grep / sed question

adev111 · Nov 21, 2006

Hi,
I change my text files here manually which have the data in the following format.

1085961616.474 172 190.104.253.84 TCP_MISS/200 2146 GET

http://cfg.mywebsearch.com/mysaconfg.jsp?

[07fiG10lKmU4M3R:IQ.TCH] - DIRECT/63.236.66.14 text/html

1085961622.602 60 68.22.217.209 TCP_HIT/200 9476 GET

http://image.linkexchange.com/01/73/69/31/banner468x60.gif

- NONE/- image/gif

1085961627.502 159 190.104.253.84 TCP_REFRESH_HIT/304 339 GET

http://scripts.lycos.com/catman/login.mail.lycos.com.cm/logout.js

- DIRECT/192.43.217.199 application/x-javascript

1085961648.792 6 12.168.85.240 TCP_MISS/503 1585 GET

http://erp.water.com:9930/

- NONE/- text/html

I change these files to have the url's only i.e. as below

http://cfg.mywebsearch.com/mysaconfg.jsp

http://image.linkexchange.com/01/73/69/31/banner468x60.gif

Now I filter out my urls according these ways
1) I look for everything else other than _hit/200 and _miss/200 and remove it from the file
2) Then i remove all the lines that contain a question "?" in the result from above
3) Then from the result in 2 above, and copy the url only.

Currently I only have 10 records in the text file, however down the line I'll be receiving these files in megs. As a result I am trying to make my life easier before hand by creating a shell script.

Now here is my algorithm. get the line number that do not have the expression _hit/200 andn _miss/200, pick it and delete it. then for the ones left truncate everything before http: and after .js or .jsp or .html or .gif it will leave me with only the urls.

I am not so good at script syntax but here's what I've come up with so far.

Code:

grep "(_hit|MISS)/200)*.!\?*"log.txt -- 
[/code/
Would someone tell me what am I doing wrong?

Thanks in advance

adev111 · Nov 21, 2006

Here's what I came up with so far.

awk '($4~/_(miss|HIT)\/200/){print $7}' NLANR-Logs.txt | grep -v "?"

Any insights?

Annihilannic · Nov 22, 2006

The grep -v part won't work because "?" has special meaning for grep. You can do it all in the awk part anyway, e.g.:

[tt]awk '($4~/_(MISS|HIT)\/200/) && !/\?/ {print $7}' NLANR-Logs.txt [/tt]

Annihilannic.

stevexff · Nov 22, 2006

Code:

#!/usr/bin/perl
use strict;
use warnings;

my %urls;

while (<>) {
   next unless /TCP_(HIT|MISS)\/200/;   
   my @parts = split(/\s+/, $_);
   my ($url, @junk) = split(/\?/, @parts[6]);
   $urls{$url}++;
}

print "Count\t\tURL\n\n";

foreach (sort keys %url) {
   print "$url{$_}\t\t$_\n";
}

Not tested (no linux box here), but should summarise counts per URL. To invoke

Code:

myscript [i]log.txt[/i]

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

grep / sed question

adev111

Programmer

adev111

Programmer

Annihilannic

MIS

stevexff

Programmer

Similar threads

Part and Inventory Search

Sponsor