Producing and using website statistics
Posted by Steve on Mon 31 Jan 2005 at 00:28
It's very useful to be able to view the statistics of websites, to see how visitors are finding your sites, which pages are the most popular, etc. Debian contains several packages for presenting this information to you, and here we'll look at two of them.
When it comes to viewing statistics of your website there are a few things that you have to bear in mind:
- Looking at the Apache logfiles doesn't tell you the complete number of visitors to your website - because users may be sharing a proxy, or cache.
- You can't tell how many visitors you've actually had because of caches, proxies, and badly behaving browsers.
- You can't tell how people move around your website due to caching at the client side.
- You may never know how your users arrived at your site because Referrer information may be missing, or incorrect.
- You can't tell how long users read your pages for, nor can you tell how they left your site or where they went next.
With those caveats out of the way the package you'll choose to display your statistics will probably depend on two things:
- How easy the setup and maintainence is.
- Whether the information presented is that which you care about.
In a very simple way the total number of visits to your website can be achieved by merely counting all the lines inside your apache access log with the following command:
wc -l /var/log/apache/access.log
However this doesn't take very much into account, for example a single visit to your website to view the front page might result in multiple requests, for example to load a CSS file and a group of graphics.
Similar simple statistics can be achieved from the command line, such as showing the number of unique visitors to your site:
awk '{print $1}' | sort -u | wc -l
(This extracts the first part each line in the logfile, which is the hostname or IP address of the visitor, sorts these entries removing duplicates and then counts them)
However this doesn't take into accounts "visits per day", or "visits per month". In short if you wish to view interesting statstics like this you'll need to create a lot of different scripts.
Alternatively you could install a real statistics viewer which has already been created, such as awstats or analog.
Both of these tools work in exactly the same way. They will read in the logfile which Apache has produced, and then process the entries internally before producing a collection of HTML pages somewhere with statistics inside them.
The Debian packages will be installed to work with the default Apache configuration which Debian users, which has the logfile located in /var/log/apache/access.log. If you've moved this for your sites then you'll need to make changes.
AwstatsWebalizerAwstats is a versatile logfile processor which is written in Perl.
You can see a sample of the output which it produces by looking at the online Awstats sample page - this shows you the unique visitors per month, top search requests which users used to find your site, and other information.
The awstats package is configured via the files in /etc/awstats/ directory. There is a global configuration file, and a local one which may be modified to make changes.
The most obvious changes to make are the following settings:
LogFile="/var/log/apache/access.log" # Enter the log file type you want to analyze. # Possible values: # W - For a web log file # S - For a streaming log file # M - For a mail log file # F - For a ftp log file # Example: W # Default: W # LogType=W # Examples for Apache combined logs (following two examples are equivalent): # LogFormat = 1 # LogFormat = "%host %other %logname %time1 %methodurl %code %bytesd %refererquo t %uaquot" # LogFormat=4 SiteDomain=""By default when awstats runs it merely produces a datafile in /var/lib/awstats, this will be dated by the time it has run. It doesn't produce static output files unless you update the configuration.
To view the statistics you must invoke an online CGI script which will take the statistics it has condensed and created then produce the output you can inspect from the browser.
To do that you must visit the following URL in your browser:
http://www.example.com/cgi-bin/awstats.plIf you wish to have static HTML pages created instead you must run the following command line:
/usr/share/doc/awstats/examples/awstats_buildstaticpages.pl -update \ -config=/etc/awstats/awstats.conf \ -dir=/var/www/stats/ \ -awstatsprog=/usr/lib/cgi-bin/awstats.plThis will use the configuration file "/etc/awstats/awstats.conf", to build some static pages which it will place in "/var/www/stats".
As you can see this is quite a mouthful! However it's a simple thing to add to a script to run once a day.
If you do this then you should disable the default updating of the statstics which happens every ten minutes by removing the file /etc/cron.d/awstats - if you are building static pages only once a day it is a waste of time updating the statics for online viewing more often.
To handle multiple sites involves making a copy of the configuration file /etc/awstats/awstats.conf to a new name /etc/awstats/awstats.name.conf.
Once this is done you can then update the statistics for a single host by specifying on the command line:
-config=nameThis will update the statistics for the named configuration file.
You can also examine the simple script /usr/share/doc/awstats/awstats-update which attempts to update all configuration files, modifying this to build static pages for each host is a simple enough matter.
Webalizer is a flexible webstats producer which is written in C, which helps make it nice and fast.
Installing the Debian package is as simple as running:
apt-get install webalizerThis will lead you through some basic questions by using debconf to prompt for answers.
By default the package will install a daily cron job which will cause the system to process the logfiles once a day, it will always run after the default Apache logfile rotation, which means that instead of examining the logfile /var/log/apache/access.log it will use the previous one /var/log/apache/access.log.1.
To configure the software you must look at the global file /etc/webalizer.conf.
There are at least two options you will need to adjust:
# LogFile defines the web server log file to use. If not specified # here or on on the command line, input will default to STDIN. LogFile /var/log/apache/access.log.1 # OutputDir is where you want to put the output files. This should # should be a full path name, however relative ones might work as well. # If no output directory is specified, the current directory will be used. OutputDir /var/www/webalizerThe rest of the options you can adjust as you wish.
This works well for single sites, but if you have a group of websites all on the same machine you might need to make some changes.
The way that I handle multiple websites on one host is to place all the files beneath a common directory /home/www, such as:
/home/www/ |-- www.site1.com | |-- htdocs | | `-- stats | `-- logs `-- www.site2.com |-- htdocs | `-- stats `-- logsHere we have two sites www.site1.com, and www.site2.com, each has its own logs/ subdirectory where Apache places the logfiles.
To handle this simply you merely copy the default webalizer.conf file from /etc into each of the log directories:
cp /etc/webalizer.conf /home/www.site1.com/logs cp /etc/webalizer.conf /home/www.site2.com/logsNow if you make the changes to the configuration file so that each one has:
Logfile access.log OutputDir ../stats/You can update the stats by running:
cd /home/www/www.site1.com/logs webalizer -q cd /home/www/www.site2.com/logs webalizer -q(The -q flag merely makes the program run quietly).
These two commands can be placed inside a shell script and invoked automatically be a cron job belonging to a user who can write to the stats directory - and you can remove the default job by running:
rm /etc/cron.daily/webalizerThe default output of the webalizer script can be seen in the sample reports which are available here on the webalizer site, and contain information about the number of unique visitors per month, the most popular directories and the popular files.
Each aspect of the report can be customized by following instructions in the configuration file.
On this servers, I run awstats successfully, except for automatic update.
I would like to use the logrotate script to automatically update my sites like said in the FAQ (http://awstats.sourceforge.net/docs/awstats_faq.html#ROTATE) to not loose data during the apache logrotate execution.
But, it doesn't work on Debian, nor stable nor testing.
Do you know this problem or a way to solve it ?
Or, maybe you can tell if your method don't loose data.
thanks.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
"Doesn't work" is a little vague, but perhaps the error can be solved by following the instructions in /usr/share/doc/awstats/README.Debian - especially the notes on file permissions.
Steve
-- Steve.org.uk
[ Parent | Reply to this comment ]
In /etc/logrotate.d/ I create a file named after the domain so for example /etc/logrotate.d/domainname which handles the logrotation for that domain. Simply make use of the prerotate and endscript features. Here is a working example:
/home/user/www/logs/access.log {
daily
missingok
rotate 120
compress
delaycompress
notifempty
create 644 root root
sharedscripts
prerotate
/home/user/www/AWStats/cgi-bin/awstats.pl -update -config=/home/user/www/AWStats/cgi-bin/awstats.conf >/dev/null 2>&1
endscript
postrotate
/etc/init.d/apache reload #>/dev/null 2>&1
endscript
}
Comment or uncomment depending on what you wish to be notified about. Initial tests comment out >/dev/null 2>&1 so you receive any errors. After that though uncomment it so you don't get an unneeded email. However I do like to see that the apache process was successfully reloaded without errors so I leave it commented.Hope this helps! - Rob
[ Parent | Reply to this comment ]
Awstats appears to be more feature rich than webalizer, as it may be used to produce reports on a larger array of logs (httpd, mail, ftp...), and there are a number of useful plugins developed for it. Also, awstats' html reports are much easier on the eyes than those generated by webalizer. On the other hand, awstats takes a bit more effort to configure and may require you to change (weaken?) the permissions of your apache logs. Of the two, I felt awstats had more inherent security risks than webalizer; and, in fact, a rather infmaous flaw was discovered recently in awstats that led to the compromise of some well known sites.
Webalizer is quite easy to set up, as the debian developers have put a lot of effort into the package and debconf walks you through the installation nicely. Webalizer's reports are perfectly useful and adequate.
[ Parent | Reply to this comment ]
<Directory "/usr/lib/cgi-bin">
AllowOverride None
Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
Order allow,deny
Allow from all
AuthType basic
AuthName "cgi-bin restricted"
AuthUserFile /etc/awstats/awstats.htpasswd
<Files "awstats.pl">
Require valid-user
</Files>
</Directory>
then create password file with:
htpasswd -c /etc/awstats/awstats.htpasswd yourlogin
(just don't forget not to use -c when you will be adding next user...) And from now on only authorized persons can watch your statistics.
Except that—if you are security paranoid, you can additionally secure password by allowing access to stats by https only. Even if you have multiple vhosts and only one ip—by default awstats shows statistics for domain in which it was run, but you can also use:
https://www.httpsdomain.org/cgi-bin/awstats.pl?config=www.httponl y.org to access other domains.
So if you make redirect in each vhost section of apache2 http config similiar to:
Redirect /cgi-bin/awstats.pl https://www.httpsdomain.org/cgi-bin/awstats.pl?config=www.httponl y.org
you have simpler path to write and access to unencrypted awstats page blocked.(Sorry for posting the same two times, but the forum has a small bug that removed some lines previously. This time everything should be ok, I hope.)
[ Parent | Reply to this comment ]
awk '{print $1}'
(This extracts the first part each line in the logfile, which is the hostname or IP address of the visitor, sorts these entries removing duplicates and then counts them)
Does it?! Which version of awk are you using? ;-) ITYM something along the lines of
awk '{print $1}' | sort -u | wc -l
or awk '{print $1}' | sort | uniq -c | sort -rn -k1
for some report-like to look at.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Of course you are correct. I've updated the text.
I must have made a bad edit at the time ...
Steve
--
[ Parent | Reply to this comment ]
Thanks, yet another easy to understand article. As a newbie I'm not that sure how to invoke the command 'webalizer -q' using cron and a shell script. Could you amplify by any chance?
I have attempted to invoke the command with a crontab line to no avail.
In theory there would be many logs files to update depending on the number of domains on the server.
So is it possible to setup one script and cron job that updates all stat files??
Rgs Pete
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Yes using one script is the best way. Say /usr/local/bin/update-stats has the following contents:
cd /home/www/www.steve.org.uk/logs /usr/bin/webalizer -q cd /home/www/www.debian-administration.org/logs /usr/bin/webalizer -q
Then to run this script once a day use this in your crontab file:
0 0 * * * /usr/local/bin/update-stats
(You can see a simple introduction to crontab here.)
Alternatively you could use logrotate to run the script, as described in this article. To do that modify /etc/logrotate.d/apache (or apache2) to have:
/home/www/*/logs/*.log {
daily
missingok
rotate 5
compress
delaycompress
notifempty
create 644 root root
sharedscripts
prerotate
/usr/local/bin/update-stats
endscript
postrotate
/etc/init.d/apache2 restart
endscript
}
(Of course I'm assuming you store your websites beneath /home/www/foo.com {htdocs logs cgi-bin} - that might not be how you do things ...)
Steve
--
[ Parent | Reply to this comment ]
So if I setup the file update-stats in the following dir
/usr/local/bin/update-stats
And update-stats contains the following lines
cd /home/compass/posh-promdresses.co.uk/logs
/usr/bin/webalizer -q
and append this to the crontab file
0 0 * * * /usr/local/bin/update-stats
The stats update magic should work right??
Need I adjust the webalizer.conf file which currently reads
# LogFile /var/log/apache/access.log.0
LogFile /var/log/apache/access.log.1
# OutputDir is where you want to put the output files. This should
# should be a full path name, however relative ones might work as well.
# If no output directory is specified, the current directory will be used.
OutputDir /var/www/webalizer
If I enter /usr/local/bin/update-stats and call update-stats I get a permission denied error :-(
Rgs Pete
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Check the permissions of the output directory, and of the logfile.
Perhaps your user doesnt have read/write permission.
(Anything beneath /var/www is going to be unwritable to non-root users. Unless you make changes...)
Steve
--
[ Parent | Reply to this comment ]
../stats and logfile location is LogFile /var/log/apache2/access.log.1.
i am getting the following error
./logs: line 36: LogFile: command not found
./logs: line 42: OutputDir: command not found
./logs: line 65: Incremental: command not found
./logs: line 81: ReportTitle: command not found
./logs: line 92: HostName: command not found
./logs: line 244: HideSite: command not found
./logs: line 247: HideReferrer: command not found
./logs: line 250: HideReferrer: command not found
./logs: line 253: HideURL: command not found
./logs: line 254: HideURL: command not found
./logs: line 255: HideURL: command not found
./logs: line 256: HideURL: command not found
./logs: line 257: HideURL: command not found
./logs: line 263: GroupURL: command not found
./logs: line 303: IgnoreSite: command not found
./logs: line 304: IgnoreReferrer: command not found
./logs: line 325: MangleAgents: command not found
Thanks for your help
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Looks like you're trying to execute the configuration file - that looks like a bash error.
Show the commands you're running as well as the result and it might be more clear what is going on...
Steve
--
[ Parent | Reply to this comment ]
But i have copied(as you suggested file name logs) a separate conf file under apache2-default folder under this i have webalizer conf file(output dir change to ../stats) and i have created stats folder under apache2-default .
When i run webalizer -q output for this apache2-default folder is not copying to apache2-default/stats folder and it is empty.
hope this clears the doubt
thanks for your help
[ Parent | Reply to this comment ]
Can you help me why i am not getting output the the perticular folder?.#
Thanks
[ Parent | Reply to this comment ]
Goto /usr/share/doc/webalizer/, there you'll find "cron-multiple-config". Edit the file to remove the header comments. Now `cp /usr/share/doc/webalizer/cron-multiple-config /etc/cron.daily/webalizer`(back up /etc/cron.daily/webalizer if you ever decide to go back to just doing a single webalizer site). Now create /etc/webalizer/. Move /etc/webalizer.conf into it. Now create multiple webalizer.conf in that dir according to each site, and edit the fields respectatively.
[ Parent | Reply to this comment ]
I've found only sawmill, but it's shareware.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
[ Parent | Reply to this comment ]
I copied webalizer.conf over under a new name and changed paths for each of my web sites. Easy as pie.
Very informative article. Thanks!
[ Parent | Reply to this comment ]