Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Filename parsing from logs

Status
Not open for further replies.

Cybex1

Technical User
Sep 3, 2011
33
US
I need to parse through a log file and extract all filenames for a given file extension. I have something ugly that is working but I know there has to be a better way... I need to exclude records that are searches for filenames with extensions (hence, the grep -v /search). I tried to avoid listing each html hex code during the strip down process but I couldn't get gwak to work with a regular expression. I tried something like; gawk '{ /%[0-9][A-F]/; print $(NF)}'. Any help would be greatly appreciated! Below is what I have so far and the desired results. Below that is the sample data I have been using.



gawk '{print $7|"sort"}' httplog.txt|grep '\.rar'|grep -v '\/search'|gawk -F \.rar '{print $1 ".rar"}'|gawk -F \/ '{print $(NF)}'|gawk -F \= '{print $(NF)}'|gawk -F %2F '{print $(NF)}'|gawk -F %3B '{print $(NF)}'|gawk -F %252B '{print $(NF)}'|gawk -F %2B '{print $(NF)}'|gawk -F + '{print $(NF)}'|gawk -F html '{print $(NF)}'|sort|uniq

Replace.Studio.Business.Edition.v7.5.Retail-FOSI.rar
Replace.Studio.Pro.v7.5.Retail-FOSI.rar

test data
-----------httplog.txt---------------
-----------httplog.txt---------------
 
It seems like the sample data isn't exactly what you started with since the initial gawk '{print $7... already ends up with blank data.

In any case... using just the provided test data this should do the trick:

Code:
sed 's/%25/%/g;s/%../_/g' httplog.txt | egrep -o '[[:alnum:].-]+\.rar' | sort | uniq

I'm assuming you have GNU grep available (with the -o option) since you are using GNU awk.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Annihilannic ,

Good catch on the truncated data (print $7), I completely missed that when I posted... That said, thank you for you help! I like the code, I don't fully understand it, but it does work. The only thing I see is that it looks as though it is pulling browser queries as well. I did a 'uniq -c' and the counts on your code are higher than mine. What I cant have is a filename that was only searched for and not downloaded. I know there is no way to avoid all false positives but I need to cut out the “/search” records from being processed. It's hard to show here due to the fact that the file that was searched for was later downloaded... Any thoughts on how to address this part? I just picked up the Sed and Awk book so I look forward to trying to figure out your solution.
 
Should anything more complex than a grep -v search in the pipeline be required? I'm not sure whether you have more complex cases than that...

The sed just replaces all %25s with a % (since some of the "special" characters have been escaped twice) and then replaces any encoded character with an underscore, assuming that is a character that doesn't appear in your target filenames.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
I thought the same thing after I looked at what the sed command was doing. This what I came up with...


sed 's/%25/%/g;s/%../_/g' httplog.text |grep -v '\/search'| egrep -o '[[:alnum:].-]+\.rar' | sort | uniq


Again, Thank you!
 
How would I use something other than an underscore, I am seeing some issues with filenames being cut short due to the original filename having an underscore in it? I looked at the sed syntax but I can see how to substitute it.
 
Change the underscore in the sed script to another character that you don't expect to find in any of the filenames... but also add an underscore to the egrep regexp (after the hyphen for example) to include it as a valid filename character... as well as any other characters you expect to find in a filename.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top