PHP & MariaDB based pageview stats best approach

MGCJerry

Limp Gawd
Joined
Sep 18, 2005
Messages
144
The place I used to post my code questions has degraded a lot, thus I'm going to try [H] and see how it goes. I've had lots of good advice from here. :) I've typically bounced my ideas off others for ideas to get an overall input of what others think.

I've come from being a long time PHPNuke user (groan), and have learned what ideas 'work', and what doesn't and how NOT to do certain things (seriously, writing a search function for each and every module? Write one reusable function, or make it into a class and be done with it)... As a result I've setup a few of my sites using a lightweight in-house CMS which has a basic page view counter ('blah' page has been viewed 150 times). I have been entertaining the idea to expand on the view counter to give me stats down to the day & hour. I like what PHPNuke had for its stat breakdown, but to have 4 tables, and run 4 queries to update these tables is... Well not my idea of lightweight.

Here's what I have in mind but it utilizes one table, & 2 queries per page view. The site is already at 16 queries but one could be removed with this "new" stats idea so I would only be up to 17 queries per page load (some pages have more depending on which site)... Memory usage is also in the 1.5-1.6MB average range (which to me, seems high but it does make sense with a few things I have going on)

Proposed Structure:
table name 'views'
'stid' (primary key, auto increment),
'statdate' (datestamp)
'h1' (int8) unsigned,
'h2' (int8) unsigned,
'h3' (int8) unsigned,
... etc up to h24.

* A query to check for a row that is the current date. If no row exists, create one for the current date.
* UPDATE the row for the current hour (1AM will update 'h1+1', 13:36 will update 'h13+1', etc).

This will allow me to pull stats for a specific day and see an hourly breakdown without some stupid logic needed to parse through the data.

My original idea was to insert a new row every hour, but after a year of stats I would have 8,760 rows! 'nuke_stats_hour' from my old nuke site is 100,728 rows! (2002-2015). My new idea would only be 365 rows per year or 4,745 for 13 years of 'stats'.

I know there are plenty of good already made scripts. I know there is Google Analytics. Why do I want to reinvent the wheel?... Plenty of reasons. Because I'm old school (*shakes walker* get off my lawn & my lan). Because I don't want bloat. Because I want simplicity, its king. Because function>form. Because programming this is trivial. Because I like programming. Because like the rest of the site (minus a couple 3rd party extras), I want it to be mine to do with as I please.

So does this sound like a solid, simple plan? I'm quite open to an alternate way of doing things or other suggestions.

Edit:
I went ahead and done this, and as a test, I populated my views table with 51,000 records and did a number of tests speed tests. The speed was what I found acceptable according to HeidiSQL along with using 'explain' to optimize the queries I could use. I did have to add an index to improve handling when looking for stats for specific items.

Another change I did was create a field for the 'page'. This way I can see what pages are seen on what times & dates. I'll be removing the page counter from the content table, and relying on this way of handling stats. I will be able to remove 2 old queries because they are now made redundant by this change.
 
Last edited:
The number of records in a 'views' table will have to be equal to the actual page impressions. I could be very well missing your point but how are you planning to go around that?

This will allow me to pull stats for a specific day and see an hourly breakdown without some stupid logic needed to parse through the data.

If so, I would do at least something like this:

table views{
pageid unsigned int, indexed //the page you are displaying
viewdate current_timestamp //you can parse this using maria's date/time functions
}

Now, this looks very minimalistic, but - you can use the 'between' clause to pick literally any time period to analyse.
select count(*) from views where pageid=somepage and viewdate between date1 and date2
There are functions that convert the timestamp into various 'readable' forms and vice versa.
 
The structure of the table has changed a little to accommodate a few changes.

The pageviews are purely counted using this system. The only thing that *may* get out of sync is the "stat_sum" (which is also increased along with the current hour), but I have a query/button that can fix that if it becomes an issue.

As of right now, I can do a query "where stat_date=25 and stat_month=1 and stat_year=2016" and that will give me all the rows for today. A row is tied to a named site page so I can only have a maximum amount of rows per day as I have per named pages. I find doing this gives me a very versatile way of getting data from the database.

I have a couple sample images to show, the pageview data is artificial. Excuse the mess in the graphs, as these have to be edited some more since the code was written awhile back and needs to be more templated & portable.

Here is a small sample of the database table... Here is the actual structure;
`stat_date` char(2) NOT NULL
`stat_month` char(2) NOT NULL
`stat_year` char(4) NOT NULL
`stat_page` varchar(50) NOT NULL
`stat_sum` int(8) unsigned DEFAULT '0'
`h0` int(8) unsigned DEFAULT '0'
`h1` int(8) unsigned DEFAULT '0'
`h2` int(8) unsigned DEFAULT '0'
`h3` int(8) unsigned DEFAULT '0'
`h4` int(8) unsigned DEFAULT '0'
`h5` int(8) unsigned DEFAULT '0'
`h6` int(8) unsigned DEFAULT '0'
`h7` int(8) unsigned DEFAULT '0'
`h8` int(8) unsigned DEFAULT '0'
`h9` int(8) unsigned DEFAULT '0'
`h10` int(8) unsigned DEFAULT '0'
`h11` int(8) unsigned DEFAULT '0'
`h12` int(8) unsigned DEFAULT '0'
`h13` int(8) unsigned DEFAULT '0'
`h14` int(8) unsigned DEFAULT '0'
`h15` int(8) unsigned DEFAULT '0'
`h16` int(8) unsigned DEFAULT '0'
`h17` int(8) unsigned DEFAULT '0'
`h18` int(8) unsigned DEFAULT '0'
`h19` int(8) unsigned DEFAULT '0'
`h20` int(8) unsigned DEFAULT '0'
`h21` int(8) unsigned DEFAULT '0'
`h22` int(8) unsigned DEFAULT '0'
`h23` int(8) unsigned DEFAULT '0'
`last_visit` char(13)
`ipaddy` varchar(15)

statdbsample.jpg


With using some simple queries and a simple graph class I have, here is the results of a simple query (one query per graph to avoid pulling large amounts of data and doing it in php)... The "Daily Activity" chart is today's overall site wide page views. That text will be changed later.
statsample.jpg
 
Disclaimer: I am not trying to be a smartass :D
Also, I do not have experience with frameworks like nuke. And I don't know how proficient you are writing from scratch.
Now, that being said.

Yeah I totally get how this works right now.

So, I guess I could point out some caveats in this approach if you'd like.

First, requirements-wise: are you interested in keeping track only of impressions?
Are you cool with the fact that one person or robot will juke your stats by constantly refreshing a given page?

Second, this is server time, right? Are you expecting users from different timezones? Because, if so, the results will be skewed.
It will seem like some pages get a lot of attention during the night, when in fact they are accesing it during the daytime - which is probably what you'd actually like to know.

Also, you seem worried about optimization.
This is good, looks like you're trying to be responsible with your coding.
But, sadly, without a thorough knowledge of the inner workings of PHP, the HTTP server (Apache? Nginx?) and the RDBMS (in this Maria) and the browser, it's quite possible you will not be able to pinpoint your bottlenecks.

(one query per graph to avoid pulling large amounts of data and doing it in php)
Do you mean one query per 'graph bar' (hour)?
Performing a query does not always mean pulling unneeded data. Could you give an example of an approach that would, as you put it, pull large amounts of data?

How good are you with SQL? Do you know the 'group by' clause?

For example:

table views{
page_id varchar/int, //you seem to go by title, so varchar in this case
visit_time datetime //this stores time in a year-month-day hour-minute-second format
}

That's how I implemented logs in an application I've written for work. It tracks events such as access denial, SQL errors, data modification. So, currently I have about 25 000 entries.

Now, to solve your problem, I used just one query:

Select count(*),visit_time from views group by hour(visit_time).

Result:
Code:
+----------+---------------------+
| count(*) | data                |
+----------+---------------------+
|        3 | 2011-04-11 06:48:08 |
|     1044 | 2011-02-16 07:51:17 |
|     3855 | 2011-02-16 08:02:37 |
|     3418 | 2011-02-21 09:04:58 |
|     3807 | 2011-02-17 10:53:58 |
|     2843 | 2011-02-17 11:00:09 |
|     3880 | 2011-02-16 12:16:25 |
|     3145 | 2011-02-16 13:35:30 |
|     3364 | 2011-02-15 14:42:18 |
|     1813 | 2011-02-18 15:14:24 |
|       68 | 2011-05-02 16:00:42 |
|        2 | 2016-01-11 17:13:17 |
|        4 | 2015-04-09 19:02:08 |
+----------+---------------------+
13 rows in set (0.01 sec)
 
Disclaimer noted. Its nice to have these kinds of discussions. I also like seeing input from the outside too because sometimes its the only way to see another point of view. ;) Also I'm on medication for my sinus thing so my writing might be a bit odd...

The whole CMS I'm working on (minus template engine, and a couple optional libraries) is from scratch.

I'm primarily interested in impressions. Other more generic metrics are already available within the system but I do plan on expanding those like I am doing to the stats now. Yes, I know that a bot, or visitor can really do a number on my datasets. I actually ran across a bot that I found does NOT obey robots.txt (disallowed), and it hammered my site pretty good - and skewed the data to the tune of 500 pageviews. .htaccess now tells that bot where he can go. :D

The time used is server-time, which is currently set to Americas/New_York (my timezone, and its a configurable option in the CMS). Server is physically located in Dallas, TX (central time). If I do decide to make the stats visible by users they will have to be logged in and properly have their timezone set. Right now, stats are purely administrative.

I'm not overly worried about optimization, I just don't want to do things unnecessarily. As far as optimization, my whole outlook is to do something. Then look at it and see if it can be improved upon without getting 'too small' to make it difficult to follow in later "updates". Programming is not only for computers, but also the programmers (me in this case) maintaining code. Smaller != better (at least every-time). I also try to avoid excessive "fakeFunction(anotherOne(thisFunction($data, 'arg0', 'arg1')));" But, I know a lot of optimization comes from the database structure, the database itself (Maria) as well as server specific (Apache) and even some OS level stuff. If you've seen how nuke works and how its written, you'll know where I'm coming from. ;) I guess the word I want is concise. I've also been working with my host to make sure I don't introduce any code on my live site that eats too much load.

As far as the graphs... I have the day's activity chart, & the yearly chart.
► I run a query to get the daily activity (1-30) and hand it off to the chart creator.
► I then run another query to get the yearly activity (potentially thousands of rows) and hand it off to the chart creator.

Alternatively, I could do a query to get everything needed in one query, then loop through all the returned rows and assign the data to their respective arrays (daily & year) then generate the necessary graph. On my dev server, I found the 2 query version easier to follow (visually & mental) with not much difference in speed (getting microtime before the data and seeing what it is at the end of the data). The differences were within the overlaps between both tests.

I'm not going to say I'm an expert on SQL, but I am familiar with 'group by' and use it quite often along with joins, and subqueries. One of my original plans was to use 2 tables, and a join query or some kind of sub-query. Then I came up with the idea I'm using now. Group by is actually part of the query that gets the pageviews from this table.
Query=SELECT stat_page, sum(stat_sum) AS stat_sum, max(last_visit) AS last_visit FROM prefix_pageviews GROUP BY stat_page ORDER BY stat_sum DESC, stat_page
pagestats.jpg


The actual site if you're curious. It was an old pen and paper RPG site that I have run online a few times as well: http://2thextreme.org (the code is primarily written on my dev server before being placed online).
 
Well, if it fits the requirements and it's easier for you to maintain I'd say you're set.

One thing though, I think keeping the sum in the database might be unneeded and possibly resource taxing.

Your views table will be mainly written to and seldom (relatively) read. With each view you will have to increment both the proper 'h' field and also increment the sum field.

How about calculating the sum only during pulling data for the charts?

Also a thing I liked in PHP was building multi dimensional association tables during data retrieval. For example, $row['id']['name'].
 
You do make a good point. I did the sum in the database because I'm assuming it may be faster to increment 2 fields on update vs doing a sum(field) via query when viewing the stats. But like you said, relatively speaking the stats are high write, low read. I'll have to look into this further. Funny enough, I originally didn't have a sum field, it was added later in the project.

Multi-dimensional tables are great fun, if sometimes a bit confusing. The menu in use of the site is a nested multi-dimensional array and that code really needs another pass on the administration side. Its confusing. My 'problem' is I like using arrays for function arguments. One function that does a lot of stuff, but has multiple types of outputs (especially math related, or UI related).
 
You could get away with caching the sum like that, or indeed improve your activity logging tenfold if you found a way to have those types of queries deferred.

Is there a way to do delayed updates in your database engine?
Maybe move the actual queries to a subroutine and place them in some sort of queue instead (and have a server side robot process that queue using, say, a task scheduler).
That way an impression won't be burdened with waiting for the stats part to execute.
The user will have a smoother experience.

It might be overkill, but I'm just saying - that count bump will have to happen on every read operation.

You say passing arrays as args - guessing sometimes you pass multiple arrays? comma-separated? With some directives so to instruct the function?
Configurable function behavior (especially returned type) is probably bad in the long run. At least in my case it proved to be.
As your code grows, I'm guessing so might the number of used data types.
While that grows, so will the number of minute differences between each type's handling.

That in turn will cause your functions to constantly grow in bloat possibly exponentially due to extra code for checks.

Or maybe have the function itself take a guess at what's being passed. Is_array, is_numeric and such.

The 'more significant' rules to adhere to here is I think keeping the functions as short as possible, maintaining a consistent vocabulary in naming them.

Anyway, in the end you should focus on getting the job done. No need to kill oneself with a giant book about sql transactions and stored procedures if all you need right now is already available.
 
Yea, I think the idea of a cron/scheduled task to update stats could be good for an extremely busy site, but highly overkill in my situation. I have MariaDB 10, which supports 'delayed' on inserts and 'low_priority' on updates. According to the docs, "If you use the LOW_PRIORITY keyword, execution of the UPDATE is delayed until no other clients are reading from the table." This may actually be a good idea for the stats since their insert into or updates are not critical and don't need to be done *immediately*.

Yes, a few of my functions received arguments as arrays but not a lot of them. One of the main things I use is for handling language translations that are run within the logic and then are used as display elements on the page.

I have a class of arrays that handles languages and options.
Code:
    var $aactivate = array(
        0 => _DEACTIVATE,
        1 => _ACTIVATE
    );
    var $enabledis = array(
        0 => _DISABLED,
        1 => _ENABLED
    );
    var $accesstype = array(
        0 => _DISABLED,
        1 => _EVERYONE,
        2 => _ANONYMOUS_ONLY,
        3 => _REGUSER_ONLY,
        4 => _SUBSCRIBER_ONLY,
        5 => _ADMIN_ONLY,
    );
// 20 or so other arrays along the same lines...

One function is "displayActive($data, $arraydata=null, $options=null)" is probably the most complex that handles my display functions based on my "language" arrays as above. Suppose I have a field where something is enabled or disabled. displayActive(1,null,'highlight'); it will add a "highlight" CSS field to highlight that value based on input (css: enabled is green, disabled is red).

$options can be either array or a string. Strings for simple highlighting operations, a specific array i used for UI icons.

Now suppose I want show if the data is "enabled" or "disabled" before its sent to the template engine.
displayActive(1, $cmsarray->enabledis, 'highlight')... This will show "enabled" highlighted green...
displayActive(0, $cmsarray->enabledis)... This will show "disabled"....
displayActive(0, $cmsarray->enabledis, true)... This will show "disabled" highlighted red...
using 'true' instead of highlight has the same effect. This is to preserve some scripts that use the old version of this function until they can be updated.

Now to add another ability was to use icons to indicate current status with a red X image or a green checkmark.
$opts = array('icon', 'check');
displayActive(0, $cmsarray->enabledis, $opts)... This will show an image based "X" icon. If the function cant find the theme based image (named 'check' in this example), it will use the 'default' icon.

There are a handful of available options. On any invalid options input, it will simply return the input. If the input is invalid it will return false. Typically, the "$input" is tied to whatever $arraydata is but in the event of no input to $arraydata when using the function it simply defaults to "enabled/disabled".

I purposely made this function this 'complicated' because I wanted it simple enough for general usage with minimal arguments ( displayActive(0)" will return "disabled") but also able to do more on specific scripts without having to create individual functions. This function was actually consolidated from 4 other functions. Other than this one and a few more functions the rest is kept as simple as possible.

The all uppercase are the constants that are used from the language file. Language can be translated from this inc.arrays.php file or from the language file (english.php).

Here is a an example of my resulting code when using this function when sending it to the template...
Code:
            $template->insert_loop('blocks.list_header.blockslist', array(
                'BLOCK_STATUS' => displayStatus($blocks['active'], $cmsarrays->accesstype, true)
            ));

The html template has "This block is {BLOCK_STATUS}", and in this case if the block is enabled, it will be highlighted green else it will be highlighted red.
 
Last edited:
I'm glad you're finding this useful.

Okay, about that articular set of problems.

I do know and use code light this:

var $enabledis = array(
0 => _DISABLED,
1 => _ENABLED)

However, I reverse them.
It becomes
var $enabledis = array(
DISABLED => 0,
ENABLED => 1
}

That way you don't need to look up the option definitions to hunt for the right number.
By placing them on opposite sides, you are capable of human readable self-describing constants.

Say: if($_SESSION['accesstype'] == $accesstype['ADMIN_ONLY]']){
(open ADMIN controls, with the $_SESSION populated at logon from the DB. )
}

Also, if the args are arrays, you probably want the default arguments not null. Null is really nasty to check for. How about giving some default arguments.

displayActive($data, $cmsarray->enabledis=TRUE, opts['highlight']=DEFAULT_HIGHLIGHT_COLOUR)...
 
Back
Top