Hi,
I change my text files here manually which have the data in the following format.
1085961616.474 172 190.104.253.84 TCP_MISS/200 2146 GET [07fiG10lKmU4M3R:IQ.TCH] - DIRECT/63.236.66.14 text/html
1085961622.602 60 68.22.217.209 TCP_HIT/200 9476 GET - NONE/- image/gif
1085961627.502 159 190.104.253.84 TCP_REFRESH_HIT/304 339 GET - DIRECT/192.43.217.199 application/x-javascript
1085961648.792 6 12.168.85.240 TCP_MISS/503 1585 GET - NONE/- text/html
I change these files to have the url's only i.e. as below
Now I filter out my urls according these ways
1) I look for everything else other than _hit/200 and _miss/200 and remove it from the file
2) Then i remove all the lines that contain a question "?" in the result from above
3) Then from the result in 2 above, and copy the url only.
Currently I only have 10 records in the text file, however down the line I'll be receiving these files in megs. As a result I am trying to make my life easier before hand by creating a shell script.
Now here is my algorithm. get the line number that do not have the expression _hit/200 andn _miss/200, pick it and delete it. then for the ones left truncate everything before http: and after .js or .jsp or .html or .gif it will leave me with only the urls.
I am not so good at script syntax but here's what I've come up with so far.
I change my text files here manually which have the data in the following format.
1085961616.474 172 190.104.253.84 TCP_MISS/200 2146 GET [07fiG10lKmU4M3R:IQ.TCH] - DIRECT/63.236.66.14 text/html
1085961622.602 60 68.22.217.209 TCP_HIT/200 9476 GET - NONE/- image/gif
1085961627.502 159 190.104.253.84 TCP_REFRESH_HIT/304 339 GET - DIRECT/192.43.217.199 application/x-javascript
1085961648.792 6 12.168.85.240 TCP_MISS/503 1585 GET - NONE/- text/html
I change these files to have the url's only i.e. as below
Now I filter out my urls according these ways
1) I look for everything else other than _hit/200 and _miss/200 and remove it from the file
2) Then i remove all the lines that contain a question "?" in the result from above
3) Then from the result in 2 above, and copy the url only.
Currently I only have 10 records in the text file, however down the line I'll be receiving these files in megs. As a result I am trying to make my life easier before hand by creating a shell script.
Now here is my algorithm. get the line number that do not have the expression _hit/200 andn _miss/200, pick it and delete it. then for the ones left truncate everything before http: and after .js or .jsp or .html or .gif it will leave me with only the urls.
I am not so good at script syntax but here's what I've come up with so far.
Code:
grep "(_hit|MISS)/200)*.!\?*"log.txt --
[/code/
Would someone tell me what am I doing wrong?
Thanks in advance