Web Usage Reports
Web usage reports are produced by IU WebHost Services with a tool called Analog. Analog comes from Dr. Stephen Turner of the University of Cambridge Statistical Laboratory. You can read about Analog and its basic model in the analog specific sections of this document found below. Your report is accessible at http://www-reports.iu.edu/[account name].Find out about our local web usage processing and how we're using Analog for you in the first section below.
Table of Contents
-
IU WebHost Processing Using Analog
Frequently Asked Questions
Futures for IU WebHost Log Processing
How the web works
The Reports
[Editor's note: the next two sections contain IU WebHost specific information]
IU WebHost Log Processing Using Analog
IU WebHost Services supports a variety of web servers, most notably www.indiana.edu and www.iupui.edu. Browser requests for information from the web servers and their virtual hosts are recorded in web logs. These logs are processed in several steps to provide web usage reports that are based on Information Provider accounts. Logs from the various IU WebHost supported servers and virtual hosts are "split" each night by "ownership" into groups of log entries that will be processed as a unit by Analog. "Ownership" is based on a set of rules and equates loosely to Information Provider accounts. Analog will be run nightly for each Information Provider account, producing a set of reports as described below in The Reports.The set of reports for one Information Provider account may actually contain statistics from more than one web server or virtual host. The "directory" and "request" reports will reflect this possibility by displaying identifying server information.
You will see the statistics accumulate over the course of the month as the set of reports is replaced nightly. End of month processing will produce the computer-readable output (meant to be used with a desktop product such as a spreadsheet package) and a trend report (the normal set of reports, but accumulated over several months). The end of month processing will usually take place within a few days of the first of the month.
In order to accommodate a variety of needs for the consumers of the web access statistics more complete information is made available in the ".computer" format of the reports (Computer-readable Output). The ".html" format reports have some limits (or floors) for the amount of data they display. The Organization Report displays statistics for those organizations which made at least 0.1% of the requests. There is a similar limit for the Domain Report: at least 0.2% of the requests. All of the statistics for these two reports is available through the ".computer" format. The Directory Report is limited to at least 0.1% of the requests and the Failure Report is limited to the top 200 requests. These two limits apply to both the ".html" and the ".computer" formats. Limits have also been set for the browser and referrer reports until more in known regarding the quantity of data they will generate.
Several of the reports contain statistics "by page". It is important to note the file extensions that are considered to be pages for local log processing. Those extension currently include ".html", ".htm", ".shtml", ".shtm", and "no extension".
For some readers it may not be immediately obvious how to determine the total number of requests for your site, successful and otherwise. You can determine this figure by adding "Successful requests", "Failed requests" and "Redirected requests" from the General Summary Report. You may also add the figures for all of the entries in the Status Code Report.
Your analog reports will be accessible at http://www-reports.iu.edu/[account name] where you'll find a list of available reports. Any of these reports may be saved through your browser's options to your desktop for future reference.
Many factors were considered while investigating some forty-five products for web access statistics processing. Among these factors were:
- "year 2000" compliance
- accuracy
- feature richness
- ability to "split" statistics by a number of factors into appropriate accounts
- ability to process logs from many platforms
- technical architecture requirements
- speed and efficiency of processing
- processing limitations (such as maximum number of logs, log size, log format, chronological order of logs, etc.)
- cost and support
- report distribution alternatives
- feedback from other users of the product
Frequently Asked Questions
- Are there limits for the reports?
- When the new reports were initially released in January 2000 there were limits set on some reports in order to minimize file size and transmission time. In order to accommodate the broadest range of needs most limits have been removed. Some limits are still in effect in the "html" version of reports but have been removed in the "computer" version of reports.
- What counts as a page?
- File extensions currently considered as pages include ".html", ".htm", ".shtml", ".shtm", and "no extension".
- How do I find the total requests for my site (not just the successful requests)?
- For some readers it may not be immediately obvious how to determine the total number of requests for your site, successful and otherwise. You can determine this figure by adding "Successful requests", "Failed requests" and "Redirected requests" from the General Summary Report. You may also add the figures for all of the entries in the Status Code Report.
- What are unresolved numerical addresses?
- Currently domain name resolution takes place when a request is processed by the Apache web server (or not at all), not by Analog. It is not always possible for the ip addresses to be resolved to host names. When this is the case, the entries will be listed in the Domain and Organization reports as "unresolved numerical addresses".
- Can I customize my Analog reports to see data for just graphics files?
- It is possible that in the future some customization of reports by IPs will be possible. In the meantime, the reports are configured to meet the needs of most IPs.
- Is referrer information available?
- Both browser (agent) and referrer data are being made available. See The Reports for more information.
- Why can't I just have access to the raw logs for my account?
- At this point in time many computers at IU are identified by a person's network userid. As a result the raw logs can be used to track personally identified information. The query string portion of the request, as well as the authuser field are also log file sources of personally identifiable data. University policy is specific about the distribution of personally identifiable information. The web server logs must be processed to eliminate this nformation prior to release to account owners.
Futures for IU WebHost Log Processing
While use of Analog and associated changes in IU WebHost log processing were prompted by Year 2000 issues, WebHost hopes to provide you with an improving service. In the future we hope to provide additional services such as graphics that will illustrate data contained in the reports. And there may be other possibilities such as account owner customization of report configuration commands and access to "santitized" raw log files.[Editor's note: below are excerpts from Dr. Turner's document found at http://www.analog.cx/]
How the Web works
This section is about what happens when somebody connects to your web site, and what statistics you can and can't calculate. There is a lot of confusion about this. It's not helped by statistics programs which claim to calculate things which cannot really be calculated, only estimated. The simple fact is that certain data which we would like to know and which we expect to know are simply not available. And the estimates used by other programs are not just a bit off, but can be very, very wrong. For example (you'll see why below), if your home page has 10 graphics on, and an AOL user visits it, most programs will count that as 11 different visitors!
This section is fairly long, but it's worth reading carefully. If you understand the basics of how the web works, you will understand what your web statistics are really telling you.
I should say that this section has benefited from several earlier expositions of these ideas. In particular, I can recommend four excellent articles: Interpreting WWW Statistics by Doug Linder; Making Sense of Web Usage Statistics by Dana Noonan; Getting Real about Usage Statistics by Tim Stehle; and, the most negative of all, Why Web Usage Statistics are (Worse Than) Meaningless by Jeff Goldberg.
Basic Model
Let's suppose I visit your web site. I follow a link from somewhere else to your front page, read some pages, and then follow one of your links out of your site. So, what do you know about it? First, I make one request for your front page. You know the date and time of the request and which page I asked for (of course), and the internet address of my computer (my host). I also usually tell you which page referred me to your site, and the make and model of my browser. I do not tell you my username or my e-mail address.Next, I look at the page (or rather my browser does) to see if it's got any graphics on it. If so, and if I've got image loading turned on in my browser, I make a separate connection to retrieve each of these graphics. I never log into your site: I just make a sequence of requests, one for each new file I want to download. The referring page for each of these graphics is your front page. Maybe there are 10 graphics on your front page. Then so far I've made 11 requests to your server.
After that, I go and visit some of your other pages, making a new request for each page and graphic that I want. Finally, I follow a link out of your site. You never know about that at all. I just connect to the next site without telling you.
Caches
It's not always quite as simple as that. One major problem is caching. There are two major types of caching. First, my browser automatically caches files when I download them. This means that if I visit them again, the next day say, I don't need to download the whole page again. Depending on the settings on my browser, I might check with you that the page hasn't changed: in that case, you do know about it, and analog will count it as a new request for the page. But I might set my browser not to check with you: then I will read the page again without you ever knowing about it.The other sort of cache is on a larger scale. I'm in the UK. Because the link across the Atlantic is sometimes very congested, we've set up a national cache. (Many individual ISP's also do the same thing.) I can set my browser to get your pages from the national cache instead of directly from you. If anyone else in the country has used the cache to look at your pages recently, the cache will have saved them, and will give them out to me without ever telling you about it. So hundreds of people could read your pages, even though you'd only sent it out once. Also, if the page I wanted wasn't already stored in the cache, the cache would ask for it from you on my behalf. This would mean that the request appeared to come from the cache, rather than from me. If several people did this, you would think that only one host was accessing the cache, rather than lots of different ones.
What you can know
The only things you can know for certain are the number of requests made to your server, when they were made, which files were asked for, and which host asked you for them. You can also know what people told you their browsers were, and what the referring pages were. You should be aware, though, that many browsers lie deliberately about what sort of browser they are, or even let users configure the browser name. Also, a few browsers send incorrect referrers, telling you the last page that the user was on even if they weren't referred by that page.What you can't know
- i. You can't tell the identity of your readers. Unless you explicitly require users to provide a password, you don't know who connected or what their e-mail addresses are.
- ii. You can't tell how many visitors you've had. You can guess by
looking at the number of distinct hosts that have requested things from you.
But this is not always a good estimate for three reasons. First, if users get
your pages from a local cache server, you will never know about it.
Secondly, sometimes many users appear to connect from the same host:
either users from the same company or ISP, or users using the same cache
server. Finally, sometimes one user appears to connect from many
different hosts. AOL now allocates users a different hostname for every
request. So if your home page has 10 graphics on, and an AOL user visits
it, most programs will count that as 11 different visitors!
- iii. You can't tell how many visits you've had. Many programs, under pressure from advertisers' organisations, define a "visit" (or "session") as a sequence of requests from the same host until there is a half-hour gap. This is an unsound method for several reasons. First, it assumes that each host corresponds to a separate person and vice versa. This is simply not true in the real world, as discussed in the last paragraph. Secondly, it assumes that there is never a half-hour gap in a genuine visit. This is also untrue. I quite often follow a link out of a site, then step back in my browser and continue with the first site from where I left off. Should it really matter whether I do this 29 or 31 minutes later? Finally, to make the computation tractable, such programs also need to assume that your logfile is in chronological order: it isn't always, and analog will produce the same results however you jumble the lines up.
- iv. Cookies don't solve these problems. Some sites try to count their visitors by using cookies. But this can only work if you refuse to let people read your pages who can't or won't take a cookie. And you still have to assume that your visitors will use the same cookie for their next request.
- v. You can't follow a person's path through your site. Even if you assume that each person corresponds one-to-one to a host, you don't know their path through your site. It's very common for people to go back to pages they've downloaded before. You never know about these subsequent visits to that page, because their browser has cached them. So you can't track their path through your site accurately.
- vi. You often can't tell where they entered your site, or where they found out about you from. If they are using a cache server, they will often be able to retrieve your home page from their cache, but not all of the subsequent pages they want to read. Then the first page you know about them requesting will be one in the middle of their true visit.
- vii. You can't tell how they left your site, or where they went next. They never tell you about their connection to another site, so there's no way for you to know about it.
- viii. You can't tell how long people spent reading each page. Once again, you can't tell which pages they are reading between successive requests for pages. They might be reading some pages they downloaded earlier. They might have followed a link out of your site, and they might or might not return later. They might have interrupted their reading for a quick game of Minesweeper. You just don't know.
I've presented a somewhat negative view here, emphasising what you can't find out. Web statistics are still informative: it's just important not to slip from "this page has received 30,000 requests" to "30,000 people have read this page." In some sense these problems are not really new to the web -- they are present just as much in print media too. For example, you only know how many magazines you've sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the web too, rather than making up spurious numbers.
Analog's definitions
This section describes how analog defines its terms, and exactly what is counted in each category. It gets a bit technical at times -- if you're just trying to understand the reports, I recommend you read the section on Analog's reports first.We start with some basic definitions. The host is the computer which has asked you for a file. The file might be a page (i.e., an HTML document) or it might be something else, such as an image. The total requests counts all the files which have been requested, including pages, graphics, etc. (Some people call this the number of hits, but that word is also used in other ways by other people, so I avoid it). The requests for pages obviously only counts pages. The referrer for a request is the place that the user (or his computer) heard about your file from. If he followed a link to reach a page, it will be the previous page. In the case of a graphic on a page, the referrer will be the page containing the graphic.
Analog recognises four categories of request, based on the HTTP status code of the request. You can see the total number of requests for each status code, and what the codes mean, in the Status Code Report. (Or see the HTTP spec for a detailed description.)
First, successful requests are those with HTTP status codes in the 200's (where the document was returned) or with code 304 (where the document was requested but was not needed because it had not been recently modified and the user could use a cached copy). Sometimes the logfile line doesn't contain a status code. These lines are also assumed by analog to be successes.
Redirected requests are those with other codes in the 300's, indicating that the user was directed to a different file instead. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). The other common cause of redirected requests is their use as "click-thru" advertising banners.
Failed requests are those with codes in the 400's (error in request) or 500's (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.
Finally, requests returning informational status code are those with status codes in the 100's. These are very rare at the moment. There are a few other types of logfile lines listed in the General Summary. Lines without status code refers to those logfile lines without a status code, and the successful requests in the General Summary only counts the ones with a status code: except if the line contains the name of the file requested, and the filename is being counted (not starred in the LOGFORMAT), then it's listed in the successes. Corrupt logfile lines are those which analog didn't manage to parse. And unwanted logfile entries are ones which we have specifically excluded. Successful requests for pages refers to those lines on which the file requested was given and was defined as a page (this includes *.html, *.htm, and */ by default).
The Reports
The General Summary contains some overall statistics about the data being analysed: the most important being the number of successful requests (the total number of files downloaded, including graphics); the number of requests for pages (just counting the various pages on your site); the number of distinct hosts (the number of different computers requests have come from); and the amount of data transferred in bytes (or MBytes).
The Daily and Hourly Summaries tell you the total number of successful requests in each day of the week, or each hour of the day, over the time period given at the very top of the report. (It's not the average, nor is it the figure for just the last week or last day).
The Monthly and Weekly reports tell you how many requests there were in each time period. They also tell you which was the busiest time period.
The Domain Report lists the countries of the computers that made the requests (assuming that the domain name was successfully resolved).
The Organisation Report attempts to list the organisations (companies, institutions, ISPs etc.) which the computer was registered under (assuming that the domain name was successfully resolved).
The File Type Report lists the extensions (representing file types) of the requested files.
The File Size Report breaks down requested files by size.
The Status Code Report lists the number of each HTTP status code that you had.
The Request Report lists which files were downloaded.
The Directory Report lists which directories the requested files came from.
The Failure Report lists the filenames which caused errors.
The Failed Referrer Report is essentially a broken link report.
The Referrer Report lists which pages linked to your files.
The Referring Site Report lists the servers those referrers were on.
The Redirection Report lists the filenames which resulted in redirections: mainly directories without the final slash, and "click-thru"'s.
The Redirected Referrer Report lists the referrers which led to redirections.
The Browser Report lists the detailed versions of browsers used, and the Browser Summary collects them by vendor.
Computer-readable Output
The computer-readable output is designed to be easy to read into spreadsheets. Each line in the preformatted output begins with a letter indicating which report the line is part of. The code letters for the reports are:x GENERAL General Summary m MONTHLY Monthly Report W WEEKLY Weekly Report d DAILY Daily Summary h HOURLY Hourly Summary o DOMAIN Domain Report Z ORGANISATION Organisation Report t FILETYPE File Type Report z SIZE File Size Report c STATUS Status Code Report i DIRECTORY Directory Report r REQUEST Request Report I FAILURE Failure Report K FAILREF Failed Referrer Report f REFERRER Referrer Report s REFSITE Referring Site Report E REDIR Redirection Report k REDIRREF Redirected Referrer Report B FULLBROWSER Browser Report b BROWSER Browser Summary
After that, there follows a field indicating the remaining columns in the report. The field consists of the letters RrPpBbD, which represent number of requests, percentage of requests, number of pages, percentage of pages, number of bytes, percentage of bytes, and date. Then there are the numerical data and then the name of the item. Times actually take up several fields: year, month, date, hour & minute, or as many of those as are necessary to identify the time.
The first line of most reports has f instead of the normal column letters, followed by the floor for the report, in the form it would be written for a FLOOR command, followed by the SORTBY using the code letters.
SORTBY Codes: r REQUESTS p PAGES b BYTES d DATE a ALPHABETICAL x RANDOM
Examples Using FLOOR Codes for the Domain Report:
DOMFLOOR 1000r # all domains with at least 1000 requests
DOMFLOOR 1000p # at least 1000 requests for pages
DOMFLOOR 1000000b # at least 1,000,000 bytes transferred
DOMFLOOR 1Mb # at least 1 megabyte
DOMFLOOR 0.5%r # 0.5% of the requests (ditto %p and %b)
DOMFLOOR 0.5:r # 0.5% of the maximum number of requests
# for any domain (ditto :p and :b)
DOMFLOOR 970701d # last access since 1st July 1997
DOMFLOOR -00-01-00d # last access in last month (see
# documentation on FROM and TO commands)
DOMFLOOR -100r # domains with top 100 number of requests
# (ditto -100p, -100b, -100d)
The general summary is a bit different. After an initial x, there is a two-character code saying what the line contains. The possible codes are
VE Version of analog HN HOSTNAME PS Program start time FR Time of first request LR Time of last request SR Total successful requests PR Total successful requests for pages FL Total failed requests RR Total redirected requests NF Number of distinct files requested NH Number of distinct hosts served CL Number of corrupt lines in the logfile BT Total number of bytes transferred
Using the Computer-readable Output (with Excel)
There are many ways to make use of the computer-readable output generated by Analog, using spreadsheet, database, or text-editor software. What we are attempting to do here is just suggest one possible method to begin to make use of the Analog data, using Microsoft Excel spreadsheet software. You may want to find other references for specifics about using Excel or whatever software package you eventually decide to use, as WebHost Services does not have the man-power to provide such support service. For specific questions about Excel or other software packages, a valuable resource is the IU Knowledge Base. You can also call the Support Center at 855-6789.First of all, you will want to view the computer-readable output with your web-browser. From your browser window, save the computer-readable output as a text (*.txt) file to your workstation, where it will be accessible by your Excel software.
Importing the data: After starting-up Excel, you can import the Analog data through the "File" drop-down menu by selecting "Open". Locate the file on your workstation. You may need to change the file type that Excel looks for to All Files (*.*) or Text (*.txt) to do this. Selecte the file to be opened, and the Text Import Wizard window will automatically open. Choose the "Delimited" option and start with row 1 of the data file. Click "Next" to move on to select "tab" as the delimiter and the double quote (") as the text qualifier. Click "Next" again to format the column data, and choose "General" for all columns. Now you can click "Finish" and the data will be imported to Excel.
Sorting the data: Since there are actually multiple types of reports represented in one Analog data file, you might want to use Excel filters to split the reports into separate worksheets or workbooks. First, select the left most column. The data in this column identifies the type of report. As an example, to separate out the rows associated with only the Request Report, select the first column and go to the "Data" drop-down menu. Choose "Filter" and then "Autofilter". Then, in the first cell of the first column, select "r" for the Request Report. Then Excel will display only the rows you have selected with the filter.
Next, you'll want to copy the appropriate report data to either another pre-existing worksheet or you can create another worksheet yourself at this step. To create a new worksheet, go to the "Insert" menu and select the "Worksheet" option. You can change the name of this new worksheet to something more meaningful by double-clicking on the worksheet tab. To copy the data, first select the data you want from the spreadsheet (you may still have a row displaying information relating to the filter selection that you will need to remove). Then choose the "Edit" menu and select "Copy". Switch to your empty worksheet or workbook, and paste the selection in the first cell by clicking on the cell, going to "Edit", and choosing "Paste". This procedure successfully isolates one type of report from the full Analog data file.
From this point you might want to delete columns that contain data in which you have no interest. And you may want to do some more filtering, perhaps selecting rows with specific attributes, such as filenames with particular character-strings. You can use the filter command over again and save the data to other worksheets. Continue this sort of process until you have the data separated into representations that fullfil your particular needs. Excel offers many options for working with and manipulating your Analog data.



