Filename parsing from logs

Cybex1 · Sep 3, 2011

I need to parse through a log file and extract all filenames for a given file extension. I have something ugly that is working but I know there has to be a better way... I need to exclude records that are searches for filenames with extensions (hence, the grep -v /search). I tried to avoid listing each html hex code during the strip down process but I couldn't get gwak to work with a regular expression. I tried something like; gawk '{ /%[0-9][A-F]/; print $(NF)}'. Any help would be greatly appreciated! Below is what I have so far and the desired results. Below that is the sample data I have been using.

gawk '{print $7|"sort"}' httplog.txt|grep '\.rar'|grep -v '\/search'|gawk -F \.rar '{print $1 ".rar"}'|gawk -F \/ '{print $(NF)}'|gawk -F \= '{print $(NF)}'|gawk -F %2F '{print $(NF)}'|gawk -F %3B '{print $(NF)}'|gawk -F %252B '{print $(NF)}'|gawk -F %2B '{print $(NF)}'|gawk -F + '{print $(NF)}'|gawk -F html '{print $(NF)}'|sort|uniq

Replace.Studio.Business.Edition.v7.5.Retail-FOSI.rar
Replace.Studio.Pro.v7.5.Retail-FOSI.rar

test data
-----------httplog.txt---------------

http://184.84.69.115/live/t00/250lo....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://199.7.177.222/dl/98581971/748293d/Replace.Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://213.244.183.200/s11.g?login=....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://64.233.169.101/__utm.gif?utm....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://64.233.169.101/__utm.gif?utm...l4573/Replace.Studio.Pro.v7.5.Retail-FOSI.rar

http://64.233.169.101/__utm.gif?utm....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://64.233.169.102/complete/search?q=Replace.Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://64.233.169.132/search?q=cach....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://69.46.36.6/click/?&h=http://....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://74.125.113.99/search?hl=&q=Replace.Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://74.125.113.99/search?q=Replace.Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://74.125.113.99/url?sa=t&sourc....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://78.140.152.122/en/files/tspll4573/Replace.Studio.Pro.v7.5.Retail-FOSI.rar

http://78.140.152.122/files/tspll4573/Replace.Studio.Pro.v7.5.Retail-FOSI.rar

http://88.212.196.66/hit?q;t44.6;rh....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://88.212.196.66/hit?t44.6;rhtt....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://98.136.154.147/imp?Z=120x600....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://98.136.154.147/imp?Z=160x600....Studio.Business.Edition.v7.5.Retail-FOSI.rar

http://98.136.154.148/imp?Z=160x600....Studio.Business.Edition.v7.5.Retail-FOSI.rar

-----------httplog.txt---------------

Annihilannic · Sep 5, 2011

It seems like the sample data isn't exactly what you started with since the initial gawk '{print $7... already ends up with blank data.

In any case... using just the provided test data this should do the trick:

Code:

sed 's/%25/%/g;s/%../_/g' httplog.txt | egrep -o '[[:alnum:].-]+\.rar' | sort | uniq

I'm assuming you have GNU grep available (with the -o option) since you are using GNU awk.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Cybex1 · Sep 5, 2011

Annihilannic ,

Good catch on the truncated data (print $7), I completely missed that when I posted... That said, thank you for you help! I like the code, I don't fully understand it, but it does work. The only thing I see is that it looks as though it is pulling browser queries as well. I did a 'uniq -c' and the counts on your code are higher than mine. What I cant have is a filename that was only searched for and not downloaded. I know there is no way to avoid all false positives but I need to cut out the “/search” records from being processed. It's hard to show here due to the fact that the file that was searched for was later downloaded... Any thoughts on how to address this part? I just picked up the Sed and Awk book so I look forward to trying to figure out your solution.

Annihilannic · Sep 5, 2011

Should anything more complex than a grep -v search in the pipeline be required? I'm not sure whether you have more complex cases than that...

The sed just replaces all %25s with a % (since some of the "special" characters have been escaped twice) and then replaces any encoded character with an underscore, assuming that is a character that doesn't appear in your target filenames.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Cybex1 · Sep 5, 2011

I thought the same thing after I looked at what the sed command was doing. This what I came up with...

sed 's/%25/%/g;s/%../_/g' httplog.text |grep -v '\/search'| egrep -o '[[:alnum:].-]+\.rar' | sort | uniq

Again, Thank you!

Cybex1 · Sep 5, 2011

How would I use something other than an underscore, I am seeing some issues with filenames being cut short due to the original filename having an underscore in it? I looked at the sed syntax but I can see how to substitute it.

Annihilannic · Sep 5, 2011

Change the underscore in the sed script to another character that you don't expect to find in any of the filenames... but also add an underscore to the egrep regexp (after the hyphen for example) to include it as a valid filename character... as well as any other characters you expect to find in a filename.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Filename parsing from logs

Cybex1

Technical User

Annihilannic

MIS

Cybex1

Technical User

Annihilannic

MIS

Cybex1

Technical User

Cybex1

Technical User

Annihilannic

MIS

Similar threads

Part and Inventory Search

Sponsor