Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parse Apache access.log Eliminate Spiders and Bots

Status
Not open for further replies.

PCHomepage

Programmer
Feb 24, 2009
609
1
16
US
I've created a relatively simple function to convert the Apache access.log file to MySQL, and the plan is to run it just once (if it doesn't time out!) per site to populate the table, then it won't need to be run again as the site will then keep the table up to date automatically with other programming. This is needed because I do not have direct access to the Apache logs on the server but the hosting company is willing to send me a copy of only mine as a one-shot deal.

However, I cannot figure out how to restrict it from parsing the spider and bot hits. For my purposes, they are irrelevant so how can this be changed to ignore or at least to minimize them so as to keep the database table more manageable in size?

Here is the function:

PHP:
function ParseToDatabase($path) {
	// Parses the NCSA Combined Log Format lines:
	$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+) ([0-9\-]+) "(.*)" "(.*)"$/';
	global $output;
	if (is_readable($path)) :
		$fh = fopen($path,'r') or die($php_errormsg);
		while (!feof($fh)) :
			$s = fgets($fh);
			if (preg_match($pattern,$s,$matches)) :
				list($whole_match, $remote_host, $logname, $user, $date_time, $method, $request, 
					$protocol, $status, $bytes, $referer, $user_agent) = $matches;
			endif;
			$replacements = array("[","]");
			$date_time = str_replace($replacements, "", $date_time);
			$date_time = strtotime($date_time);
			$date_time = date('Y-m-d H:i:s', $date_time);
			$sqlInsert = sprintf("INSERT INTO accesslog (RemoteHost, IdentUser, AuthUser, TimeStamp, Method, RequestURI, RequestProtocol, Status,  Bytes, Referer, UserAgent) 
								  VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')",
									  $remote_host, 
									  $logname, 
									  $user,
									  $date_time, 
									  $method, 
									  $request, 
									  $protocol, 
									  $status, 
									  $bytes, 
									  $referer, 
									  $user_agent);				  
			DBConnect($sqlInsert, "Insert", "db_name");
		endwhile;
		fclose($fh);
	else : 
		echo "Cannot access log file!";
	endif;
}
 
The Apache access logs for your vhost should be (assuming a Linux OS) in /home/accountname/logs/ and your control panel account should provide you with access to /accountname/ and all folders below and

You are not likely to be allowed access to /var/log/ of course, unless you are on a VPS or dedicated box.

However back at the plot;

You can exclude known 'bots by matching information in the User-Agent string, for which you will require an up to date list of 'bot user agents.

or (the lists do get updated from time to time)

And there is a 'JSON' 'feed' at which is maintained.




Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum
 
Yes, the logs are there but only as compressed .gz files and are compressed by month. Because it's a shared server with other clients, the actual logs are not available any other way which is why the hosting company offered to send me a copy. I suppose I could write a script to extract them to a file but I don't want to go that route unnecessarily. The logs for all my sites are in that same folder so it would have to get only those needed for a given site, extract them, and save them to a file that I could then parse to the database or it could even do it directly but I fear that such a script would take a long, long time to run. I am open to the idea, though, if it's a better way of doing it.

On the question above, is there some way to minimize them by simply filtering against "spider" or "bot" that most seem to have in the User-Agent? I'm not really too familiar with sting patterns or how to filter them unless I'm doing it through a database query but this is just raw log text. It's okay if it misses some but at least most will be eliminated that way.
 
the actual logs are not available any other way
You could always use Google Analytics which already has just about every method of filtering/analysing/reporting known to man.

(other analysis tools are available)

I did the "roll your own" PHP/MySQL site access logging many years ago, and abandoned the project when Google made Urchin freely available, however if you are determined to finish it, forget about filtering 'bots before saving to MySQL, because that information IS useful for 'proper' site analysis. Just make ignoring crawlers part of the reporting capabilities.

Having more information and not needing it is infinitely more valuable than needing it and realising you did not save it.





Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top