Parsing Apache access log files using PHP

Recently, I needed to whip up something quick to parse some apache_log files in ECLF (Extended Common Log Format). This file can often be found at: /var/log/http/access_log. Here is what data looks like inside the Apache ECLF access_log file: - - [20/Jan/2003:08:55:36 -0800] "GET /path/to/page.html HTTP/1.0" 200 4649 "" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

You can then break it up as follows:

Server Date / Time [SPACE]
"GET /path/to/page
HTTP/Type Request"
Success Code
Bytes Sent To Client
Client Software

The next step is creating the regular expression to parse out these fields of information from the access_log. First, I searched around the net using Google to find some solutions others might have already created. I found one, and while it was a well written article, unfortunately it was an article aimed at Perl programmers, not PHP.

Using the PHP preg function I could have attempted to break down and convert the author's Perl regular expression, but I have found in the past that it is easier and less time consuming (for me, anyway) to build my own regular expressions in PHP. I'm not suggesting my method is better, but it is faster for me anyway. You may want to put this code in your protected admin area (I did). Here is the code I worked up to parse out these fields using PHP:

$ac_arr = file('/path/to/copy/access_log');
$astring = join("", $ac_arr);
$astring = preg_replace("/(\n|\r|\t)/", "", $astring);
$records = preg_split("/([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)/", $astring, -1, PREG_SPLIT_DELIM_CAPTURE);

$sizerecs = sizeof($records);

// now split into records
$i = 1;
$each_rec = 0;
while($i<$sizerecs) {
$ip = $records[$i];
$all = $records[$i+1];

// parse other fields
preg_match("/\[(.+)\]/", $all, $match);
$access_time = $match[1];
$all = str_replace($match[1], "", $all);

preg_match("/\"GET (.[^\"]+)/", $all, $match);
$http = $match[1];
$link = explode(" ", $http);
$all = str_replace("\"GET $match[1]\"", "", $all);

preg_match("/([0-9]{3})/", $all, $match);
$success_code = $match[1];
$all = str_replace($match[1], "", $all);

preg_match("/\"(.[^\"]+)/", $all, $match);
$ref = $match[1];
$all = str_replace("\"$match[1]\"", "", $all);

preg_match("/\"(.[^\"]+)/", $all, $match);
$browser = $match[1];
$all = str_replace("\"$match[1]\"", "", $all);

preg_match("/([0-9]+\b)/", $all, $match);
$bytes = $match[1];
$all = str_replace($match[1], "", $all);

print("<br>IP: $ip<br>Access Time: $access_time<br>Page: $link[0]
<br>Type: $link[1]<br>Success Code: $success_code
<br>Bytes Transferred: $bytes<br>Referer: $ref
<br>Browser: $browser<hr>");

// advance to next record
$i = $i + 2;

NOTE: This is not an optimized regular expression. It is intentionally broken out to display how each individual piece of information could be parsed from the access_log file. This is usually how I will build regular expressions in the initial development phase so that I can more easily develop each piece. The optimization phase of the code development happens later.

Once the info is parsed into data chunks, it can next be written into a more friendly database import format using comma delimited, pipe delimited, tab delimited, whatever you want. I chose to use tabdelimited and all I had to do was add the following code just before the while iteration:

$new_format[$each_rec] = "$ip\t$access_time\t$link[0]\t$link[1]\t$success_code\t$bytes\t$ref\t$browser";

Now for creating a new file that is ready for importing into mySQL we just write the new file using the new_format array as a basis and use a new line as a record terminator:

$fhandle = fopen("/path/to/import_file.txt", "w") {
foreach($new_format as $data) {
fputs($fhandle, "$data\n");

mySQL Load Function

mySQL has built-in support for importing from text files and it is done using the mySQL LOAD function. This is much faster than creating a loop and using multiple INSERT commands. Here is the syntax for reference:

LOAD DATA infile /path/to/import_file.txt INTO TABLE newtablename (field1, field2, field3,)

You do not have to specify the field names if you match the field names exactly with the data. If you only want to use specific fields from the data, then you would indicate this.

[ Some folks reading this might be wondering why I wanted to put this data into a mySQL database and the answer is so I could aggregate and analyze the access log using a relational database. I could have done my querying on the data in a flat file of course.]

Let's create a simple mySQL table to accept this imported access log data.

ip VARCHAR(25),
accesstime DATETIME,
thepage VARCHAR(250),
thetype VARCHAR(25),
thecode CHAR(3),
thebytes INT,
theref VARCHAR(250),
browser VARCHAR(250),

mySQL veterans in the audience will notice that I'm using the mySQL DATETIME field for the accesstime but the field is not in the proper format. So I have to make a few adjustments to my original $access_time code to make it in DATETIME compliance which is YYYY-MM-DD HH:MM:SS

Here's the PHP code to do that:

$access_time = $match[1];
$date_parts = explode(":", $access_time);
$mdy = explode("/", $date_parts[0]);
$hour_parts = explode(" ", $date_parts[2]);
$access_time = "$mdy[2]-$mdy[1]-$mdy[0] $date_parts[1]:$date_parts[2]:$hour_parts[0]";

I was curious to see if mySQL would figure out that the textual "Jan, Feb, Mar" etc was really the month number, 01, 02, 03 in my import (I've never tried this before), so first I ran a query to see what mySQL did with the textual dates. Did it convert them on the fly?

The result? No such luck ... so I wrote up a simple associative array to handle this task which looks like this:

$m['Jan'] = '01';
$m['Feb'] = '02';
$m['Mar'] = '03';
$m['Apr'] = '04';
$m['May'] = '05';
$m['Jun'] = '06';
$m['Jul'] = '07';
$m['Aug'] = '08';
$m['Sep'] = '09';
$m['Oct'] = '10';
$m['Nov'] = '11';
$m['Dec'] = '12';

Here is the revised access_time code parsing:

$access_time = $match[1];
$date_parts = explode(":", $access_time);
$mdy = explode("/", $date_parts[0]);
$mnt = $mdy[1];
$mdy[1] = $m[$mnt];
$hour_parts = explode(" ", $date_parts[2]);
$access_time = "$mdy[2]-$mdy[1]-$mdy[0] $date_parts[1]:$date_parts[2]:$hour_parts[0]";

Again, I realize it isn't the prettiest code in the world and could definitely be optimized further, but it does the duty of getting the data in a format that is more usable by mySQL.

Now that the data is in the right format and is written to the import_file.txt I can issue the following mySQL LOAD function to import and insert the new rows into the table of data like this:

LOAD DATE INFILE '/path/to/file/import_file.txt' INTO TABLE statslog (ip, accesstime, thepage, thetype, thecode, thebytes, theref, browser);

Happy coding to you!

found at:
thank you for this code, if you like it please visit the link and rate it well ! - it's worth doing so!.

[ top ]