Server Log Analysis Basics
This document contains basic information about reading and interpreting server logs for Apache. The use of analysis techniques to provide "meaningful" information is well beyond the scope of the present document - volumes have been written on that topic, much of it highly speculative. This is just the basic facts.
Combined Log Format
Virtually all of the servers I've worked with have used the combined log format, which provides fairly extensive information about every request placed to the server (if there's anything else one might want to know, and that can be known, I can't think of it).
Other log formats exist, such as the "common" format (which omits some information), "extended" format (which contains more information about the request headers and network connection than seems necessary), and a wide array of proprietary log formats.
Each entry in the log contains nine chunks of information, thus:
999.999.999.999 - username [12/Mar/1994:17:29:07 -0400] "GET /path/file.html HTTP/1.1" 200 4315 "http://www.server.com/path/file.html" "Mozilla/2.14 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.2)"
In the actual log file, this is all contained in a single line - I broke it down for clarity and to prevent it from breaking the layout of this page (spaces would exist in the places where I added line breaks), and some of the information has been replaced with fake values.
Each log line tells a story of a file request. For the example above, the story would be:
At 5:29:07 p.m. on March 12, 1994, a person logged in as "username" (at an unidentified terminal) placed a request for the the resource located at /path/file.html from this server using a GET request and HTTP version 1.1 by clicking a link from http://www.server.com/path/file.html. The request could be traced back to IP address 999.999.999.999 and the user seemed to be using Netscape version 2.14, which was configured to display content in British English, on a computer running Windows NT version 6.0. The server successfully processed the request and sent back 4,315 bytes of data.
That's quite a lot of information, all pertaining to a single file request. A similar story is told for every file that every user requested from the server.
While I'm on the topic of the amount of data per hit, a quick aside about the size of server log files ...
The example above is about 200 characters, which consumes 200 bytes of data, for a single "hit." In practice, I've found it to be more in the neighborhood of 250 to 300 bytes per hit. So an easy way to guesstimate the size of log files is to figure about 275 Kb per 1,000 hits to the server.
If you host on a shared server, where you have a quota of storage space, the log files will, over time, consume that quota. If you host on a dedicated server, it will consume drive space, and the server will perform horribly (or crash) if the log files consume all available drive space (which they eventually will).
You can save about 100 bytes of data per line if you decide not to log referrers and user-agents, but I don't advise that. These two bits of information are critical to analyzing your Web site traffic patterns (knowing the sources from which your visitors come and tracking their click-path through the site) and assessing visitors' capabilities (deciding which Web browsers to support and test with).
I may eventually consolidate the notes from my work with high-volume servers (techniques for dealing with the logs that cover 10 million hits per day) ... but even if your site is not high-volume, you'll eventually need to find a way to deal with the log files.
Log Line Components
Looking at each chunk of data in the log line:
The first chunk is the IP address of the user or remote host that requested the file. If your server is set to perform host-name lookups, it would indicate the "translated" name, which is often something like cpe-74-78-155-22.ny.res.rr.com (which is the actual address of some person, probably a cable modem user in New York state, by the look of it) or crawl-66-249-67-117.googlebot.com (which is one of google's servers that checks Web sites periodically). I generally turn lookups off to improve server performance, but if a site is low-volume, there's not much harm in leaving them on for convenience (you don't have to look up host names when doing analysis to eliminate "fake" traffic from bots and crawlers).
The second chunk is usually missing. I've seldom seen anything but a hyphen. I got curious and looked it up in the server documentation, and it's some code for identifying a client machine that's seldom ever used and very easily forged, so it's not much use on an external Web server (but may be used on company intranets).
The third chunk is the username. This is also passed as a hyphen unless the user has logged in via http authentication and is requesting a file in the protected directory. It can be useful for detecting a breached or shared user account.
The fourth chunk, enclosed in square brackets, indicates the date and time the request was received along with the offset from zulu time (plus or minus a number of hours and minutes) according to the server's internal clock.
"GET /path/file.html HTTP/1.1"
The fifth chunk, enclosed in quotation marks, includes the HTTP request method, the file requested (path from the document root, or within the executables directory), and the HTTP version by which the request was received.
The sixth chunk is the HTP response code generated by the server in response to the request.
The seventh chunk is the amount of data (in bytes) sent from the server in response to the request.
The eighth chunk is the referring address - a file that contained a link that was followed to get to the requested page (or a page that contained a reference to an embedded media file, such as an image or video). It contains a hyphen if it is not received.
"Mozilla/2.14 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.2)"
The last chunk is the user-agent string passed by the client software (typically, the user's Web browser).
Some additional information about the data described above:
- A log entry is generated when a file is requested from a Web server. If a page is retrieved from the client's cache, or a cache on a proxy server (which may exist at the user's ISP, your hosting provider, or any router in between), then no log entry is made.
- The remote IP address is not specific to an individual human being. In instances where multiple computers are connected to the Internet through a single router (such as a home or office network), all users will pass the same IP address to the server. Likewise, if an ISP uses load-balancing proxies, a single user may pass a handful of different IP addresses during the course of a visit. And of course, IP addresses may fluctuate over time, such that different users may have a given IP address (and one user may have several) over the course of an hour, week, day, month, etc.
- The user-agent string is not reliable. Some Web browsers are programmed to "spoof" a user-agent or present ambiguous information (a leftover from the browser wars, or a "robot" may spoof a user-agent to masquerade as a real person).
Otherwise, the information in the log file is generally reliable. I wouldn't go so far as to say it's impossible to spoof other data, just that there doesn't seem to be much point in doing so, and I'm not aware that forgery been a widespread problem.
It's beyond the scope of the present document to get into the methods for statistical analysis that derive "meaning" from this data - but here are a sampling of the kinds of questions log data can answer:
- What times of day (and days of the week) does my Web site get the most traffic?
- How many times was each page on my server downloaded?
- What browsers are people using to look at my home page?
- How do people navigate through my site to arrive at a specific page?
- Who is stealing my bandwidth by linking to images on my server from other sites?
- Do any of my users appear to be sharing out their passwords to my members-only content?
- When people come to my site from Google, what search terms did they use to find my site?
- Are people downloading my video files completely, or quitting before they finish loading them?
- How long does the average user stay on my home page before clicking to the next file?
- What percentage of users click the various links on my menu page?
- What other Web sites are sending visitors to mine?
Each of these questions, and many others, can be answered by aggregating and running comparisons on the data contained in the server log file.