Gawk for loop, incrementing counter and fastest speed 1

madasafish · Jul 10, 2011

#for MINUTE in `cat timelist` A file with 60 numbers in ranging from 00 to 59

do

gawk -v hr=$HOUR -v min=$MINUTE -v rdate=$RDATE 'BEGIN {OFS=",";sstr=hr":"min;maxfs=0;print "DATE","TIME","COUNT","COUNT200","UNDER1","ONE2FIVE","FIVE2TEN","TEN2TWEN","TWEN2FOR","OVER40","HASHCOUNT","BLKED"}
{
if (substr ($3,1,5) == sstr)COUNT++
if (substr ($3,1,5) == sstr && ($14 == 200))COUNT200++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 < 1))UNDER1++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 >= 1 && $16 <5))ONE2FIVE++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 >= 5 && $16 <10))FIVE2TEN++
if (substr ($3,1,5) == sstr && ($14 == 200 && $16 >= 10 && $16 <20))TEN2TWEN++
if (substr ($3,1,2) == sstr && ($14 == 200 && $16 >= 20 && $16 <40))TWEN2FOR++
if (substr ($3,1,2) == sstr && ($14 == 200 && $16 >= 40 && $16 <999))OVER40++
if (substr ($3,1,2) == sstr && ($14 == 200 && $15 == "-"))HASHCOUNT++
if (substr ($3,1,2) == sstr && ($14 == 200 && $15 == 29))BLKED++
if (substr ($3,1,2) == sstr && ($14 == 200 && $15 > maxfs))maxfs=$15
}END{
print rdate,sstr,COUNT+0,COUNT200+0,UNDER1+0,ONE2FIVE+0,FIVE2TEN+0,TEN2TWEN+0,TWEN2FOR+0,OVER40+0,HASHCOUNT+0,BLKED+0,maxfs+0
}' $LOG #>> $REPORT

done
exit 0

1. Can anyone assist in putting the 1 to 60 loop within the gawk section and not on the outside like shown above. What is shown above does not produce a good report. I only want one heading.
2. I cannot get maxfs variable shown above to increment to the largest number of that 1 minute analysis.
3. Would appreciate if this code could be made more efficient. Its having to process very large files.

Here is a couple of the lines of the log file it working on.

Jul 8 12:00:25 libprx04.lang.dtv libuser: 10.91.161.3 - - [08/Jul/2011:12:00:25 +0100] "GET

http://nes.is.dtv/dtv/app/services/telewest/vod/20061013/20110101/playback.js

HTTP/1.1" 200 18742 0
Jul 8 12:00:25 libprx04.lang.dtv libuser: 10.125.60.150 - - [08/Jul/2011:12:00:25 +0100] "GET

http://zap-mss07025.dtv.ntl:9090/st...rent_HUID=TWT1766069&hierarchy_UID=TWT1766069

HTTP/1.1" 200 347 0

As always, thanks in advance.

Madasafish

Annihilannic · Jul 11, 2011

No wonder it's slow.... it's processing the entire log file 60 times!

What about the hours... are they also controlled by a similar loop? Or are you just running this job once per hour and setting the HOUR variable appropriately?

Assuming that's the case, try something like this:

Code:

LOG=access_log
RDATE=$(date)
HOUR=12

gawk -v hr=$HOUR -v rdate="$RDATE" '
        [green]BEGIN[/green] {
                [blue]OFS[/blue]=[red]"[/red][purple],[/purple][red]"[/red]
                [b]print[/b] [red]"[/red][purple]DATE[/purple][red]"[/red],[red]"[/red][purple]TIME[/purple][red]"[/red],[red]"[/red][purple]COUNT[/purple][red]"[/red],[red]"[/red][purple]COUNT200[/purple][red]"[/red],[red]"[/red][purple]UNDER1[/purple][red]"[/red],[red]"[/red][purple]ONE2FIVE[/purple][red]"[/red],[red]"[/red][purple]FIVE2TEN[/purple][red]"[/red],[red]"[/red][purple]TEN2TWEN[/purple][red]"[/red],[red]"[/red][purple]TWEN2FOR[/purple][red]"[/red],[red]"[/red][purple]OVER40[/purple][red]"[/red],[red]"[/red][purple]HASHCOUNT[/purple][red]"[/red],[red]"[/red][purple]BLKED[/purple][red]"[/red]
        }
        {
                [b]split[/b]([blue]$3[/blue],a,[green]/:/[/green])
                hour=a[1]
                min=a[2]
        }
        hour==hr {

                COUNT[min]++

                [olive]if[/olive] ([blue]$14[/blue] == 200) {

                        COUNT200[min]++

                        [olive]if[/olive] ([blue]$16[/blue] < 1               ) UNDER1[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >=  1 && [blue]$16[/blue] <   5) ONE2FIVE[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >=  5 && [blue]$16[/blue] <  10) FIVE2TEN[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >= 10 && [blue]$16[/blue] <  20) TEN2TWEN[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >= 20 && [blue]$16[/blue] <  40) TWEN2FOR[min]++
                        [olive]if[/olive] ([blue]$16[/blue] >= 40 && [blue]$16[/blue] < 999) OVER40[min]++

                        [olive]if[/olive] ([blue]$15[/blue] == [red]"[/red][purple]-[/purple][red]"[/red]) HASHCOUNT[min]++
                        [olive]if[/olive] ([blue]$15[/blue] == 29) BLKED[min]++
                        [olive]if[/olive] ([blue]$15[/blue] > maxfs[min]) maxfs[min]=[blue]$15[/blue]
                }
        }
        [green]END[/green] {
                [olive]for[/olive] (min=0; min<60; min++) {
                        [b]print[/b] rdate,[b]sprintf[/b]([red]"[/red][purple]%02d:%02d[/purple][red]"[/red],hr,min),COUNT[min]+0,COUNT200[min]+0,
                                UNDER1[min]+0,ONE2FIVE[min]+0,FIVE2TEN[min]+0,TEN2TWEN[min]+0,
                                TWEN2FOR[min]+0,OVER40[min]+0,HASHCOUNT[min]+0,BLKED[min]+0,maxfs[min]+0
                }
        }
' $LOG

Annihilannic.

madasafish · Jul 12, 2011

Thank-you Annihilannic,

Your code is very neat and really does alot to improve my understanding of how awk/nawk/gawk works. You use these "{}" like "if statements". It is also leagues ahead in terms of speed.

I do have a few more questions which I hope you can answer.

1. If I run the code against the $LOG. There are 0 entries in the first "00" to "09" minutes. Thereafter everything looks good.

2. Before I get to the gawk section (shown above) of the script, I grep out certain lines first to create the $LOG. I do this in the belief that it will be quicker to process.
eg: egrep 'string1|string2|string3|string4' ./libproxy/libproxy_11-07-08_12:00.log > $LOG. When I look at your code, I can see I can cater for this at the "if ($14 == 200) {" line. Am I right?

Once again, thank-you for your valuable assistance with this.

Madasafish

Annihilannic · Jul 13, 2011

I highly recommend the The GNU Awk User's Guide, especially the Getting Started section to get an brief overview of how awk works and understand it's general condition { expression } syntax.

1) Oops. This updated END clause should fix it by prefixing the missing 0s that are used in the array indices.

Code:

        [green]END[/green] {
                [olive]for[/olive] (min=0; min<60; min++) {
                        min=[b]sprintf[/b]([red]"[/red][purple]%02d[/purple][red]"[/red],min)
                        [b]print[/b] rdate,[b]sprintf[/b]([red]"[/red][purple]%02d:%02d[/purple][red]"[/red],hr,min),COUNT[min]+0,COUNT200[min]+0,
                                UNDER1[min]+0,ONE2FIVE[min]+0,FIVE2TEN[min]+0,TEN2TWEN[min]+0,
                                TWEN2FOR[min]+0,OVER40[min]+0,HASHCOUNT[min]+0,BLKED[min]+0,maxfs[min]+0
                }
        }

2) For this I would add another clause after the BEGIN clause:

Code:

        ![green]/string1|string2|string3|string4/[/green] { [b]next[/b] }

This is for efficiency reasons... it "short-cuts" to process the next input line when we know we're not interested in this one.

Annihilannic.

madasafish · Jul 14, 2011

Thank-you Annihilannic.

Your code above works perfectly using my cygwin gawk and also on my home linux box running Fedora gawk. I have tweaked it to remove the RDATE and included your other suggestions. All working fast and perfectly on the above mentioned platforms.....however if I take your original code above just to prove the point and edit it replacing gawk with nawk and #!/bin/bash with #!/bin/ksh and try to run it on the office Solaris "Sunos 5.8", it freezes at...

DATE,TIME,COUNT,COUNT200,UNDER1,ONE2FIVE,FIVE2TEN,TEN2TWEN,TWEN2FOR,OVER40,HASHCOUNT,BLKED

as does my tweaked version of it.

output below using "set -x"

./chkvod 06
+ + basename ./chkvod
USAGE=chkvod {hour} 00|06|11|23
+ LOGDIR=/tmp
+ HOUR=06
+ + date +%y-%m-%d
DATE=11-07-14
+ PROXYLOG=/tmp/libproxy_11-07-14_06:00.log
+ [ 1 = 0 ]
+ [ ! -f /tmp/libproxy_11-07-14_06:00.log ]
+ + date
RDATE=Thu Jul 14 17:36:59 BST 2011
+ nawk -v hr=06 -v rdate=Thu Jul 14 17:36:59 BST 2011
BEGIN {
OFS=","
print "DATE","TIME","COUNT","COUNT200","UNDER1","ONE2FIVE","FIVE2TEN","TEN2TWEN","TWEN2FOR","OVER40","HASHCOUNT","BLKED"
}
{
split($3,a,/:/)
hour=a[1]
min=a[2]
}
hour==hr {

COUNT[min]++

if ($14 == 200) {

COUNT200[min]++

if ($16 < 1 ) UNDER1[min]++
if ($16 >= 1 && $16 < 5) ONE2FIVE[min]++
if ($16 >= 5 && $16 < 10) FIVE2TEN[min]++
if ($16 >= 10 && $16 < 20) TEN2TWEN[min]++
if ($16 >= 20 && $16 < 40) TWEN2FOR[min]++
if ($16 >= 40 && $16 < 999) OVER40[min]++

if ($15 == "-") HASHCOUNT[min]++
if ($15 == 29) BLKED[min]++
if ($15 > maxfs[min]) maxfs[min]=$15
}
}
END {
for (min=0; min<60; min++) {
print rdate,sprintf("%02d:%02d",hr,min),COUNT[min]+0,COUNT200[min]+0,
UNDER1[min]+0,ONE2FIVE[min]+0,FIVE2TEN[min]+0,TEN2TWEN[min]+0,
TWEN2FOR[min]+0,OVER40[min]+0,HASHCOUNT[min]+0,BLKED[min]+0,maxfs[min]+0
}
}
/tmp/libproxy_11-07-14_06:00.log
DATE,TIME,COUNT,COUNT200,UNDER1,ONE2FIVE,FIVE2TEN,TEN2TWEN,TWEN2FOR,OVER40,HASHCOUNT,BLKED

I appreciate you may not have a Solaris box to test this on but would welcome any suggestion from other readers.

Once again, thanks in advance.

Madasafish

madasafish · Jul 14, 2011

Hi Annihilannic,

Please ignore my post above. It would seem there is another issue, If I had waited 4 minutes I would of seen the results but I kept breaking the processing before it completed thinking it had frozen.

When I "time" the code using the same log file on my cygwin/Fedora box I get...
real 0m1.831s
user 0m1.372s
sys 0m0.262s

on Solaris box I get...
real 4m28.76s
user 1m58.49s
sys 0m3.04s

Mmmmmm...I wonder why it is magnitudes slower on Solaris?

Once again, thanks for your terrific assistance and ignore the immediate post above.

Madasafish

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Gawk for loop, incrementing counter and fastest speed 1

madasafish

Technical User

Annihilannic

MIS

madasafish

Technical User

Annihilannic

MIS

madasafish

Technical User

madasafish

Technical User

Similar threads

Part and Inventory Search

Sponsor