Parsing Logs
Here's a bit of code for parsing server logs into meaningful bits of information. It's based on the combined log format (which is discussed in greater detail elsewhere).
I'm fairly confident the function is sturdy: I've been using it for over a decade at the time of this writing, and on some sites that draw millions of hits per day, and it's held up extremely well.
readLog Function
The function for reading the log file is shown below. You will need to set a configuration variable $accesslog to indicate the location of your log file.
sub readLog{ # read each log line and parse it - create an array of anything unusable open(DALOG,"$accesslog") || &scriptError(10); while(<DALOG>){ # read in the log line $llrh = $llti = $llui = $lldt = $llhr = $llrc = $llbs = $llrs = $llua = ""; $ll = $_; chomp($_); if ($ll =~ /(.*)\s(.*)\s(.*)\s\[([^\]]*)\]\s"([^"]*)"\s(.*)\s(.*)\s"([^"]*)"\s"([^"]*)"/){ $llrh = $1; $llti = $2; $llui = $3; $lldt = $4; $llhr = $5; $llrc = $6; $llbs = $7; $llbs = 0 if ($llbs eq '-'); $llrs = $8; $llrs = "" if ($llrs eq '-'); $llua = $9; $llua = "" if ($llua eq '-'); }else{ $lluu[@lluu] = $_; } # break it down a bit more $llisp = $llddf = $llddd = $llddm = $llddy = $llrhm = $llrhv = ""; $lldtf = $lldth = $lldtm = $lldts = ""; $llrff = $llrfn = $llrfx = $llrfq = ""; $llrsu = $llrsf = $llrsq = ""; if($llrh =~ /.*\.(.*)\.(.*)\Z/){$llisp = "$1.$2";} $lldt =~ s/\s.*\Z//; ($llddf,$lldth,$lldtm,$lldts) = split(':',$lldt); $lldtf = "$lldth:$lldtm:$lldts"; ($llddd,$llddm,$llddy) = split('/',$llddf); ($llrhm,$llrff,$llrhv) = split(/\s/,$llhr); if ($llrff =~ /(.*)\?(.*)/){ $llrfn = $1; $llrfq = $2; }else{ $llrfn = $llrff; } if ($llrfn =~ /.*\.(.*)/){ $llrfx = $1; } if ($llrs =~ /(.*)\?(.*)/){ $llrsu = $llrsf = $1; $llrsq = $2; }else{ $llrsu = $llrsf = $llrs; } $llrsu =~ s/\Ahttps*\:\/\///; $llrsu =~ s/\/.*//; # HERE: Analyze the data. } close(DALOG); }
The function above merely breaks apart log lines into meaningful information - it does not include any analysis of the resulting data.
Contingencies
If the server log file is not found or not readable, the function will throw a value of 10 to the scriptError subroutine.
If the function encounters a line it is unable to read, it will add it to an array of bad log lines @lluu to prevent it from corrupting the data.
Resulting Data
For each log line, the following data are parsed:
- $llrh - Remote IP or Host
- Example: adsl-75-43-223-116.dsl.lsan03.sbcglobal.net
- This will return the remote IP address of the user, or the remote host's resolved name if host name look-ups are turned on. Useful for getting a raw head-count.
- $llisp - User's ISP
- Example: sbcglobal.net
- This will parse the last two bits from the remote host, which is usually the gateway of the user's Internet Service Provider. Useful for a better assessment of the ISPs and companies from which your traffic originates.
- $llti - Terminal ID
- Example: -
- Usually passed as a hyphen, as this field is very seldom used. Not useful for anything.
- $llui - User ID
- Example: -
- For parts of the site that are password protected using standard HTTP authentication, this will reflect the username. Otherwise a hyphen will be passed. Useful for seeing which accounts are active and, coupled with the remote IP, whether passwords are being shared of have been compromised.
- $lldt - Datetime
- Example: 20/Mar/2010:17:00:31
- This is the date and time string from the server, removing the GMT offset (which is the same for all log lines). Not particularly useful unless broken into components.
- $llddf - Date: full
- Example: 20/Mar/1995
- This gives the day/month/year, which is useful for daily traffic analysis.
- $llddd - Date: day
- Example: 20
- The day only, which is useful to identify patterns and trends in your traffic.
- $llddm - Date: month
- Example: Mar
- The month only, which is not useful unless you're comparing several months' worth of data.
- $llddy - Date: year
- Example: 1995
- The year only, which is not useful unless you're comparing several months' worth of data, which is highly unlikely.
- $lldtf - Time: full
- Example: 17:00:31
- The full time signature, in hours:minutes:seconds. Military time is used (so 17:00 is 5:00 pm). This is not particularly useful until broken into components.
- $lldth - Time: hh
- Example: 17
- The hour only, which is useful to identify patterns and trends in your traffic.
- $lldtm - Time: mm
- Example: 00
- The hour only, which might be useful to identify patterns and trends in your traffic if you have a high-volume site and wish to monitor minute-to-minute fluctuations.
- $lldts - Time: ss
- Example: 31
- The second only, which is not particulary useful.
- $llhr - HTTP Request
- Example: GET /cgi/myscript.pl?key=value&key=value HTTP/1.1
- The full HTTP request details - method, URI, and version. This is not particularly useful until broken into components.
- $llrhm - Request Method
- Example: GET
- The HTTP request method only.
- $llrhv - HTTP Version
- Example: HTTP/1.1
- The HTTP version only, which is virtually always "HTTP/1.1".
- $llrff - File: Full
- Example: /cgi/myscript.pl?key=value&key=value
- The name of the file only, which should be broken down into file and query string to avoid a lot of one-off data.
- $llrfn - File: File only
- Example: /cgi/myscript.pl
- The file name only, excluding the query string, useful for determining which resources are the most/least popular
- $llrfx - File: Extension
- Example: pl
- The extension only, useful in determining which kinds of resources are being used (and, coupled with referrer, detecting bandwidth thieves)
- $llrfq - File: Query String
- Example:
- The query string sent. This is not useful as an aggregate, but may be analyzed on a file-by-file basis (for example, to see what input is most common for a particular script.
- $llrc - Response Code
- Example: 200
- Indicates the HTTP response code, which can help identify files that are not being completely downloaded, bad references, etc.
- $llbs - Bytes Sent
- Example: 14677
- The number of bytes referred. THis can be aggregated to assess your bandwidth consumption or cross-referenced with referrers to detect abuse of your resources.
- $llrs - Referrer
- Example: http://www.ocmoto.com/index.php?topic=26173.0
- The URL of a page that referred the user to the current one. Since it includes query-string data, it is not useful to aggregate.
- $llrsu - Referrer: URL
- Example: www.ocmoto.com
- The top-level URL only. Useful for detecting which Web sites are sending traffic to you and excluding your own site from certain analyses.
- $llrsf - Referrer: File
- Example: http://www.ocmoto.com/index.php
- The referring address, stripped of query string, to help aggregate data about sites that send users to yours, or click-trails within you own site.
- $llrsq - Referrer: Query String
- Example: topic=26173.0
- The query string of the referrer, useful in determining what keywords users are entering into search engines to find your site.
- $llua - User-Agent
- Example: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8 (.NET CLR 3.5.30729)
- Passed by the user's software (web browser), this is useful in assessing what software (web browsers) are being used to view your site or eliminating "false traffic" from crawlers and bots. This value is highly variable and easily forged.
Usage
The function provided merely breaks down a log entry into smaller bits of data for further analysis. To get any value from it, you will still need to write the code that does the analytical tasks - but this function should save you the effort of parsing the component data.
And finally
This is online for my own use and reference, but feel to snag it if you think it would be useful. It's a trifle and I don't expect to be credited or compensated in any way ... but nor does it come with any sort of guarantee.