jim.shamlin.com

Parsing Logs

Here's a bit of code for parsing server logs into meaningful bits of information. It's based on the combined log format (which is discussed in greater detail elsewhere).

I'm fairly confident the function is sturdy: I've been using it for over a decade at the time of this writing, and on some sites that draw millions of hits per day, and it's held up extremely well.


readLog Function

The function for reading the log file is shown below. You will need to set a configuration variable $accesslog to indicate the location of your log file.

sub readLog{
  # read each log line and parse it - create an array of anything unusable
  open(DALOG,"$accesslog") || &scriptError(10); 
  while(<DALOG>){
     # read in the log line
     $llrh = $llti = $llui = $lldt = $llhr = $llrc = $llbs = $llrs = $llua = "";
     $ll = $_;
     chomp($_);
     if ($ll =~ /(.*)\s(.*)\s(.*)\s\[([^\]]*)\]\s"([^"]*)"\s(.*)\s(.*)\s"([^"]*)"\s"([^"]*)"/){
        $llrh = $1;
        $llti = $2;
        $llui = $3;
        $lldt = $4;
        $llhr = $5;
        $llrc = $6;
        $llbs = $7; $llbs = 0 if ($llbs eq '-');
        $llrs = $8; $llrs = "" if ($llrs eq '-');
        $llua = $9; $llua = "" if ($llua eq '-');
        }else{
        $lluu[@lluu] = $_;
        }
     # break it down a bit more
     $llisp = $llddf = $llddd = $llddm = $llddy = $llrhm = $llrhv = "";
     $lldtf = $lldth = $lldtm = $lldts = "";
     $llrff = $llrfn = $llrfx = $llrfq = "";
     $llrsu =  $llrsf = $llrsq = "";
     if($llrh =~ /.*\.(.*)\.(.*)\Z/){$llisp = "$1.$2";}
     $lldt =~ s/\s.*\Z//;
     ($llddf,$lldth,$lldtm,$lldts) = split(':',$lldt);
     $lldtf = "$lldth:$lldtm:$lldts";
     ($llddd,$llddm,$llddy) = split('/',$llddf);
     ($llrhm,$llrff,$llrhv) = split(/\s/,$llhr);
     if ($llrff =~ /(.*)\?(.*)/){
        $llrfn = $1;
        $llrfq = $2;
        }else{
        $llrfn =  $llrff;
        }
    if ($llrfn =~ /.*\.(.*)/){
        $llrfx = $1;
        }
    if ($llrs =~ /(.*)\?(.*)/){
        $llrsu = $llrsf = $1;
        $llrsq = $2;
        }else{
        $llrsu = $llrsf =  $llrs;
        }
    $llrsu =~ s/\Ahttps*\:\/\///;
    $llrsu =~ s/\/.*//;
     
    # HERE: Analyze the data.
    
    }
  close(DALOG);
  }

The function above merely breaks apart log lines into meaningful information - it does not include any analysis of the resulting data.


Contingencies

If the server log file is not found or not readable, the function will throw a value of 10 to the scriptError subroutine.

If the function encounters a line it is unable to read, it will add it to an array of bad log lines @lluu to prevent it from corrupting the data.


Resulting Data

For each log line, the following data are parsed:

$llrh - Remote IP or Host
Example: adsl-75-43-223-116.dsl.lsan03.sbcglobal.net
This will return the remote IP address of the user, or the remote host's resolved name if host name look-ups are turned on. Useful for getting a raw head-count.
$llisp - User's ISP
Example: sbcglobal.net
This will parse the last two bits from the remote host, which is usually the gateway of the user's Internet Service Provider. Useful for a better assessment of the ISPs and companies from which your traffic originates.
$llti - Terminal ID
Example: -
Usually passed as a hyphen, as this field is very seldom used. Not useful for anything.
$llui - User ID
Example: -
For parts of the site that are password protected using standard HTTP authentication, this will reflect the username. Otherwise a hyphen will be passed. Useful for seeing which accounts are active and, coupled with the remote IP, whether passwords are being shared of have been compromised.
$lldt - Datetime
Example: 20/Mar/2010:17:00:31
This is the date and time string from the server, removing the GMT offset (which is the same for all log lines). Not particularly useful unless broken into components.
$llddf - Date: full
Example: 20/Mar/1995
This gives the day/month/year, which is useful for daily traffic analysis.
$llddd - Date: day
Example: 20
The day only, which is useful to identify patterns and trends in your traffic.
$llddm - Date: month
Example: Mar
The month only, which is not useful unless you're comparing several months' worth of data.
$llddy - Date: year
Example: 1995
The year only, which is not useful unless you're comparing several months' worth of data, which is highly unlikely.
$lldtf - Time: full
Example: 17:00:31
The full time signature, in hours:minutes:seconds. Military time is used (so 17:00 is 5:00 pm). This is not particularly useful until broken into components.
$lldth - Time: hh
Example: 17
The hour only, which is useful to identify patterns and trends in your traffic.
$lldtm - Time: mm
Example: 00
The hour only, which might be useful to identify patterns and trends in your traffic if you have a high-volume site and wish to monitor minute-to-minute fluctuations.
$lldts - Time: ss
Example: 31
The second only, which is not particulary useful.
$llhr - HTTP Request
Example: GET /cgi/myscript.pl?key=value&key=value HTTP/1.1
The full HTTP request details - method, URI, and version. This is not particularly useful until broken into components.
$llrhm - Request Method
Example: GET
The HTTP request method only.
$llrhv - HTTP Version
Example: HTTP/1.1
The HTTP version only, which is virtually always "HTTP/1.1".
$llrff - File: Full
Example: /cgi/myscript.pl?key=value&key=value
The name of the file only, which should be broken down into file and query string to avoid a lot of one-off data.
$llrfn - File: File only
Example: /cgi/myscript.pl
The file name only, excluding the query string, useful for determining which resources are the most/least popular
$llrfx - File: Extension
Example: pl
The extension only, useful in determining which kinds of resources are being used (and, coupled with referrer, detecting bandwidth thieves)
$llrfq - File: Query String
Example:
The query string sent. This is not useful as an aggregate, but may be analyzed on a file-by-file basis (for example, to see what input is most common for a particular script.
$llrc - Response Code
Example: 200
Indicates the HTTP response code, which can help identify files that are not being completely downloaded, bad references, etc.
$llbs - Bytes Sent
Example: 14677
The number of bytes referred. THis can be aggregated to assess your bandwidth consumption or cross-referenced with referrers to detect abuse of your resources.
$llrs - Referrer
Example: http://www.ocmoto.com/index.php?topic=26173.0
The URL of a page that referred the user to the current one. Since it includes query-string data, it is not useful to aggregate.
$llrsu - Referrer: URL
Example: www.ocmoto.com
The top-level URL only. Useful for detecting which Web sites are sending traffic to you and excluding your own site from certain analyses.
$llrsf - Referrer: File
Example: http://www.ocmoto.com/index.php
The referring address, stripped of query string, to help aggregate data about sites that send users to yours, or click-trails within you own site.
$llrsq - Referrer: Query String
Example: topic=26173.0
The query string of the referrer, useful in determining what keywords users are entering into search engines to find your site.
$llua - User-Agent
Example: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8 (.NET CLR 3.5.30729)
Passed by the user's software (web browser), this is useful in assessing what software (web browsers) are being used to view your site or eliminating "false traffic" from crawlers and bots. This value is highly variable and easily forged.

Usage

The function provided merely breaks down a log entry into smaller bits of data for further analysis. To get any value from it, you will still need to write the code that does the analytical tasks - but this function should save you the effort of parsing the component data.


And finally

This is online for my own use and reference, but feel to snag it if you think it would be useful. It's a trifle and I don't expect to be credited or compensated in any way ... but nor does it come with any sort of guarantee.