jim.shamlin.com

Server Log Analysis Basics

This document contains basic information about reading and interpreting server logs for Apache. The use of analysis techniques to provide "meaningful" information is well beyond the scope of the present document - volumes have been written on that topic, much of it highly speculative. This is just the basic facts.


Combined Log Format

Virtually all of the servers I've worked with have used the combined log format, which provides fairly extensive information about every request placed to the server (if there's anything else one might want to know, and that can be known, I can't think of it).

Other log formats exist, such as the "common" format (which omits some information), "extended" format (which contains more information about the request headers and network connection than seems necessary), and a wide array of proprietary log formats.


Log Content

Each entry in the log contains nine chunks of information, thus:

999.999.999.999 
- 
username 
[12/Mar/1994:17:29:07 -0400] 
"GET /path/file.html HTTP/1.1"
200
4315 
"http://www.server.com/path/file.html" 
"Mozilla/2.14 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.2)"

In the actual log file, this is all contained in a single line - I broke it down for clarity and to prevent it from breaking the layout of this page (spaces would exist in the places where I added line breaks), and some of the information has been replaced with fake values.

Each log line tells a story of a file request. For the example above, the story would be:

At 5:29:07 p.m. on March 12, 1994, a person logged in as "username" (at an unidentified terminal) placed a request for the the resource located at /path/file.html from this server using a GET request and HTTP version 1.1 by clicking a link from http://www.server.com/path/file.html. The request could be traced back to IP address 999.999.999.999 and the user seemed to be using Netscape version 2.14, which was configured to display content in British English, on a computer running Windows NT version 6.0. The server successfully processed the request and sent back 4,315 bytes of data.

That's quite a lot of information, all pertaining to a single file request. A similar story is told for every file that every user requested from the server.

Log Size

While I'm on the topic of the amount of data per hit, a quick aside about the size of server log files ...

The example above is about 200 characters, which consumes 200 bytes of data, for a single "hit." In practice, I've found it to be more in the neighborhood of 250 to 300 bytes per hit. So an easy way to guesstimate the size of log files is to figure about 275 Kb per 1,000 hits to the server.

If you host on a shared server, where you have a quota of storage space, the log files will, over time, consume that quota. If you host on a dedicated server, it will consume drive space, and the server will perform horribly (or crash) if the log files consume all available drive space (which they eventually will).

You can save about 100 bytes of data per line if you decide not to log referrers and user-agents, but I don't advise that. These two bits of information are critical to analyzing your Web site traffic patterns (knowing the sources from which your visitors come and tracking their click-path through the site) and assessing visitors' capabilities (deciding which Web browsers to support and test with).

I may eventually consolidate the notes from my work with high-volume servers (techniques for dealing with the logs that cover 10 million hits per day) ... but even if your site is not high-volume, you'll eventually need to find a way to deal with the log files.


Log Line Components

Looking at each chunk of data in the log line:

999.999.999.999 

The first chunk is the IP address of the user or remote host that requested the file. If your server is set to perform host-name lookups, it would indicate the "translated" name, which is often something like cpe-74-78-155-22.ny.res.rr.com (which is the actual address of some person, probably a cable modem user in New York state, by the look of it) or crawl-66-249-67-117.googlebot.com (which is one of google's servers that checks Web sites periodically). I generally turn lookups off to improve server performance, but if a site is low-volume, there's not much harm in leaving them on for convenience (you don't have to look up host names when doing analysis to eliminate "fake" traffic from bots and crawlers).

-

The second chunk is usually missing. I've seldom seen anything but a hyphen. I got curious and looked it up in the server documentation, and it's some code for identifying a client machine that's seldom ever used and very easily forged, so it's not much use on an external Web server (but may be used on company intranets).

username

The third chunk is the username. This is also passed as a hyphen unless the user has logged in via http authentication and is requesting a file in the protected directory. It can be useful for detecting a breached or shared user account.

[12/Mar/1994:17:29:07 -0400]

The fourth chunk, enclosed in square brackets, indicates the date and time the request was received along with the offset from zulu time (plus or minus a number of hours and minutes) according to the server's internal clock.

"GET /path/file.html HTTP/1.1"

The fifth chunk, enclosed in quotation marks, includes the HTTP request method, the file requested (path from the document root, or within the executables directory), and the HTTP version by which the request was received.

200

The sixth chunk is the HTP response code generated by the server in response to the request.

4315

The seventh chunk is the amount of data (in bytes) sent from the server in response to the request.

"http://www.server.com/path/file.html"

The eighth chunk is the referring address - a file that contained a link that was followed to get to the requested page (or a page that contained a reference to an embedded media file, such as an image or video). It contains a hyphen if it is not received.

"Mozilla/2.14 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.2)"

The last chunk is the user-agent string passed by the client software (typically, the user's Web browser).


Other Notes

Some additional information about the data described above:

Otherwise, the information in the log file is generally reliable. I wouldn't go so far as to say it's impossible to spoof other data, just that there doesn't seem to be much point in doing so, and I'm not aware that forgery been a widespread problem.


Interpretation

It's beyond the scope of the present document to get into the methods for statistical analysis that derive "meaning" from this data - but here are a sampling of the kinds of questions log data can answer:

Each of these questions, and many others, can be answered by aggregating and running comparisons on the data contained in the server log file.