Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations sizbut on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Perl and Apache Log search?

Status
Not open for further replies.

robertsdgm

IS-IT--Management
Apr 14, 2004
8
US
Hello All
It has been quite a long time since I have used perl..
I have a huge apache access log and from this I need to print out all the unique IP address's and their access time.
How do I go about doing this if my general line output is such as below?!
Thanks
Dan
cbdw88209.utp.test.dan.com - - [20/May/2004:11:47:58 -0400] "GET /srs71/doc/js/tree-menu/menu-images
/menu_folder_closed.gif HTTP/1.1" 200 135
cbdw88209.utp.test.dan.com - - [20/May/2004:11:47:58 -0400] "GET /srs71/doc/js/tree-menu/menu-images
/menu_folder_closed.gif HTTP/1.1" 200 135
cbdw88209.utp.test.dan.com - - [20/May/2004:11:48:41 -0400] "GET /srs71/doc/srsbook_srsusr/srsusr25_
1.html HTTP/1.1" 200 4793
cbdw88209.utp.test.dan.com - - [20/May/2004:11:48:41 -0400] "GET /srs71/doc/srsbook_srsusr/images/i_
icon.jpg HTTP/1.1" 200 514
cbdw88209.utp.test.dan.com - - [20/May/2004:11:48:41 -0400] "GET /srs71/doc/srsbook_srsusr/images/li
st_values_up.jpg HTTP/1.1" 200 1991
 
Code:
 my %hash_of_keys;
open FH, "<access_log";
while (<FH>) {
  ($ip_or_host, $the_rest)=split(/- -/, $_); #might need escapin'
  ($access_time,$crap)=split (/ "GET/, $the_rest);
  if ($hash_of_keys{$ip_or_host} == undef) { #might need to be eq
    $hash_of_keys{$ip_or_host} = $access_time;
  }
}

then iterate over hash to print out the keys and values

HTH
--Paul

Not tested, and its late ...
 
I don't know if Paul's code will work exactly as planned. Each entry from the example log file spans multiple (2) lines. It appears that after every 100 characters a new line character is inserted in the log. The second line of each entry is, as far as this project is concerned, garbage. I didn't test Paul's code, but I believe it will try to process the garbage just as it processes the address/access time info (I could be out in left field though, if I am, just ignore me. :))

Not that it would be hard to fix, but since the easiest way to test whether a line is valid is with a regex, why not let it do all the work?

Code:
my %results;
while (<DATA>) {
  if (/([\w.]+) - - (\[.+?\]).*/) {
    unless ($results{$1}) { $results{$1} = $2; }
  }
}

Then, just as Paul suggested, iterate through the hash to get the address/access time info.

One other thing to consider would be replacing the non-greedy operator in the regex if you wanted to make this run a bit faster.
 
AFAIK,

Log files should only occupy one line per record, at least any I've seen. For big files, I'd avoid the regex, though its most likely implied in the split, because of the overhead of firing up the regex engine for each line in a large log file

just my 0.02c
--Paul
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top