Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Gawk for loop, incrementing counter and fastest speed 1

Status
Not open for further replies.

madasafish

Technical User
Jul 18, 2006
78
TH
#for MINUTE in `cat timelist` A file with 60 numbers in ranging from 00 to 59

do

gawk -v hr=$HOUR -v min=$MINUTE -v rdate=$RDATE 'BEGIN {OFS=",";sstr=hr":"min;maxfs=0;print "DATE","TIME","COUNT","COUNT200","UNDER1","ONE2FIVE","FIVE2TEN","TEN2TWEN","TWEN2FOR","OVER40","HASHCOUNT","BLKED"}
{
if (substr ($3,1,5) == sstr)COUNT++
if (substr ($3,1,5) == sstr && ($14 == 200))COUNT200++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 < 1))UNDER1++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 >= 1 && $16 <5))ONE2FIVE++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 >= 5 && $16 <10))FIVE2TEN++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 >= 10 && $16 <20))TEN2TWEN++
if (substr ($3,1,2) == sstr && ($14 == 200 && $16 >= 20 && $16 <40))TWEN2FOR++
if (substr ($3,1,2) == sstr && ($14 == 200 && $16 >= 40 && $16 <999))OVER40++
if (substr ($3,1,2) == sstr && ($14 == 200 && $15 == "-"))HASHCOUNT++
if (substr ($3,1,2) == sstr && ($14 == 200 && $15 == 29))BLKED++
if (substr ($3,1,2) == sstr && ($14 == 200 && $15 > maxfs))maxfs=$15
}END{
print rdate,sstr,COUNT+0,COUNT200+0,UNDER1+0,ONE2FIVE+0,FIVE2TEN+0,TEN2TWEN+0,TWEN2FOR+0,OVER40+0,HASHCOUNT+0,BLKED+0,maxfs+0
}' $LOG #>> $REPORT

done
exit 0

1. Can anyone assist in putting the 1 to 60 loop within the gawk section and not on the outside like shown above. What is shown above does not produce a good report. I only want one heading.
2. I cannot get maxfs variable shown above to increment to the largest number of that 1 minute analysis.
3. Would appreciate if this code could be made more efficient. Its having to process very large files.

Here is a couple of the lines of the log file it working on.

Jul 8 12:00:25 libprx04.lang.dtv libuser: 10.91.161.3 - - [08/Jul/2011:12:00:25 +0100] "GET HTTP/1.1" 200 18742 0
Jul 8 12:00:25 libprx04.lang.dtv libuser: 10.125.60.150 - - [08/Jul/2011:12:00:25 +0100] "GET HTTP/1.1" 200 347 0

As always, thanks in advance.

Madasafish
 
No wonder it's slow.... it's processing the entire log file 60 times!

What about the hours... are they also controlled by a similar loop? Or are you just running this job once per hour and setting the HOUR variable appropriately?

Assuming that's the case, try something like this:

Code:
LOG=access_log
RDATE=$(date)
HOUR=12

gawk -v hr=$HOUR -v rdate="$RDATE" '
        [green]BEGIN[/green] {
                [blue]OFS[/blue]=[red]"[/red][purple],[/purple][red]"[/red]
                [b]print[/b] [red]"[/red][purple]DATE[/purple][red]"[/red],[red]"[/red][purple]TIME[/purple][red]"[/red],[red]"[/red][purple]COUNT[/purple][red]"[/red],[red]"[/red][purple]COUNT200[/purple][red]"[/red],[red]"[/red][purple]UNDER1[/purple][red]"[/red],[red]"[/red][purple]ONE2FIVE[/purple][red]"[/red],[red]"[/red][purple]FIVE2TEN[/purple][red]"[/red],[red]"[/red][purple]TEN2TWEN[/purple][red]"[/red],[red]"[/red][purple]TWEN2FOR[/purple][red]"[/red],[red]"[/red][purple]OVER40[/purple][red]"[/red],[red]"[/red][purple]HASHCOUNT[/purple][red]"[/red],[red]"[/red][purple]BLKED[/purple][red]"[/red]
        }
        {
                [b]split[/b]([blue]$3[/blue],a,[green]/:/[/green])
                hour=a[1]
                min=a[2]
        }
        hour==hr {

                COUNT[min]++

                [olive]if[/olive] ([blue]$14[/blue] == 200) {

                        COUNT200[min]++

                        [olive]if[/olive] ([blue]$16[/blue] < 1               ) UNDER1[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >=  1 && [blue]$16[/blue] <   5) ONE2FIVE[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >=  5 && [blue]$16[/blue] <  10) FIVE2TEN[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >= 10 && [blue]$16[/blue] <  20) TEN2TWEN[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >= 20 && [blue]$16[/blue] <  40) TWEN2FOR[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >= 40 && [blue]$16[/blue] < 999) OVER40[min]++

                        [olive]if[/olive] ([blue]$15[/blue] == [red]"[/red][purple]-[/purple][red]"[/red]) HASHCOUNT[min]++
                        [olive]if[/olive] ([blue]$15[/blue] == 29) BLKED[min]++
                        [olive]if[/olive] ([blue]$15[/blue] > maxfs[min]) maxfs[min]=[blue]$15[/blue]
                }
        }
        [green]END[/green] {
                [olive]for[/olive] (min=0; min<60; min++) {
                        [b]print[/b] rdate,[b]sprintf[/b]([red]"[/red][purple]%02d:%02d[/purple][red]"[/red],hr,min),COUNT[min]+0,COUNT200[min]+0,
                                UNDER1[min]+0,ONE2FIVE[min]+0,FIVE2TEN[min]+0,TEN2TWEN[min]+0,
                                TWEN2FOR[min]+0,OVER40[min]+0,HASHCOUNT[min]+0,BLKED[min]+0,maxfs[min]+0
                }
        }
' $LOG


Annihilannic.
 
Thank-you Annihilannic,

Your code is very neat and really does alot to improve my understanding of how awk/nawk/gawk works. You use these "{}" like "if statements". It is also leagues ahead in terms of speed.

I do have a few more questions which I hope you can answer.

1. If I run the code against the $LOG. There are 0 entries in the first "00" to "09" minutes. Thereafter everything looks good.

2. Before I get to the gawk section (shown above) of the script, I grep out certain lines first to create the $LOG. I do this in the belief that it will be quicker to process.
eg: egrep 'string1|string2|string3|string4' ./libproxy/libproxy_11-07-08_12:00.log > $LOG. When I look at your code, I can see I can cater for this at the "if ($14 == 200) {" line. Am I right?

Once again, thank-you for your valuable assistance with this.

Madasafish












 
I highly recommend the The GNU Awk User's Guide, especially the Getting Started section to get an brief overview of how awk works and understand it's general condition { expression } syntax.

1) Oops. This updated END clause should fix it by prefixing the missing 0s that are used in the array indices.


Code:
        [green]END[/green] {
                [olive]for[/olive] (min=0; min<60; min++) {
                        min=[b]sprintf[/b]([red]"[/red][purple]%02d[/purple][red]"[/red],min)
                        [b]print[/b] rdate,[b]sprintf[/b]([red]"[/red][purple]%02d:%02d[/purple][red]"[/red],hr,min),COUNT[min]+0,COUNT200[min]+0,
                                UNDER1[min]+0,ONE2FIVE[min]+0,FIVE2TEN[min]+0,TEN2TWEN[min]+0,
                                TWEN2FOR[min]+0,OVER40[min]+0,HASHCOUNT[min]+0,BLKED[min]+0,maxfs[min]+0
                }
        }

2) For this I would add another clause after the BEGIN clause:

Code:
        ![green]/string1|string2|string3|string4/[/green] { [b]next[/b] }

This is for efficiency reasons... it "short-cuts" to process the next input line when we know we're not interested in this one.

Annihilannic.
 
Thank-you Annihilannic.

Your code above works perfectly using my cygwin gawk and also on my home linux box running Fedora gawk. I have tweaked it to remove the RDATE and included your other suggestions. All working fast and perfectly on the above mentioned platforms.....however if I take your original code above just to prove the point and edit it replacing gawk with nawk and #!/bin/bash with #!/bin/ksh and try to run it on the office Solaris "Sunos 5.8", it freezes at...

DATE,TIME,COUNT,COUNT200,UNDER1,ONE2FIVE,FIVE2TEN,TEN2TWEN,TWEN2FOR,OVER40,HASHCOUNT,BLKED

as does my tweaked version of it.

output below using "set -x"


./chkvod 06
+ + basename ./chkvod
USAGE=chkvod {hour} 00|06|11|23
+ LOGDIR=/tmp
+ HOUR=06
+ + date +%y-%m-%d
DATE=11-07-14
+ PROXYLOG=/tmp/libproxy_11-07-14_06:00.log
+ [ 1 = 0 ]
+ [ ! -f /tmp/libproxy_11-07-14_06:00.log ]
+ + date
RDATE=Thu Jul 14 17:36:59 BST 2011
+ nawk -v hr=06 -v rdate=Thu Jul 14 17:36:59 BST 2011
BEGIN {
OFS=","
print "DATE","TIME","COUNT","COUNT200","UNDER1","ONE2FIVE","FIVE2TEN","TEN2TWEN","TWEN2FOR","OVER40","HASHCOUNT","BLKED"
}
{
split($3,a,/:/)
hour=a[1]
min=a[2]
}
hour==hr {

COUNT[min]++

if ($14 == 200) {

COUNT200[min]++

if ($16 < 1 ) UNDER1[min]++
if ($16 >= 1 && $16 < 5) ONE2FIVE[min]++
if ($16 >= 5 && $16 < 10) FIVE2TEN[min]++
if ($16 >= 10 && $16 < 20) TEN2TWEN[min]++
if ($16 >= 20 && $16 < 40) TWEN2FOR[min]++
if ($16 >= 40 && $16 < 999) OVER40[min]++

if ($15 == "-") HASHCOUNT[min]++
if ($15 == 29) BLKED[min]++
if ($15 > maxfs[min]) maxfs[min]=$15
}
}
END {
for (min=0; min<60; min++) {
print rdate,sprintf("%02d:%02d",hr,min),COUNT[min]+0,COUNT200[min]+0,
UNDER1[min]+0,ONE2FIVE[min]+0,FIVE2TEN[min]+0,TEN2TWEN[min]+0,
TWEN2FOR[min]+0,OVER40[min]+0,HASHCOUNT[min]+0,BLKED[min]+0,maxfs[min]+0
}
}
/tmp/libproxy_11-07-14_06:00.log
DATE,TIME,COUNT,COUNT200,UNDER1,ONE2FIVE,FIVE2TEN,TEN2TWEN,TWEN2FOR,OVER40,HASHCOUNT,BLKED

I appreciate you may not have a Solaris box to test this on but would welcome any suggestion from other readers.

Once again, thanks in advance.

Madasafish


 
Hi Annihilannic,

Please ignore my post above. It would seem there is another issue, If I had waited 4 minutes I would of seen the results but I kept breaking the processing before it completed thinking it had frozen.

When I "time" the code using the same log file on my cygwin/Fedora box I get...
real 0m1.831s
user 0m1.372s
sys 0m0.262s

on Solaris box I get...
real 4m28.76s
user 1m58.49s
sys 0m3.04s

Mmmmmm...I wonder why it is magnitudes slower on Solaris?

Once again, thanks for your terrific assistance and ignore the immediate post above.

Madasafish



 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top