Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Sub-totalling an array 1

Status
Not open for further replies.

madasafish

Technical User
Jul 18, 2006
78
0
0
TH
Code:
gawk ' BEGIN { sitea=0 ; siteb=0 }
        {
                if ($1 ~ "192.168.6") { sitea++ } else { siteb++ }

                x[sitea","siteb","$7","$9]
        }

        END {
                for (i in x)
                print i

        }' $FILE


No doubt and probably needless to say, the above code does not give me the required result.
It is reading an apache access log and I am seeking TOTAL http hits for "sitea" and TOTAL http hits for "siteb"

Example report:
SITEA,SITEB,URL,HTTP Response Code,
29,53,/apps/tube/icon_Tube.png,200
60,102,/apps/mix/icon_mix.png,200
389,536,/publish/images/vpl.png,404
etc....

As always thanks in advance,

Madasafish
 
Hi Feherke,

I am using gawk V4. I believe it was you who introduced me to "patsplit" on another thread.

Madasafish
 
gawk --version
GNU Awk 4.0.0
Copyright (C) 1989, 1991-2011 Free Software Foundation.
 
Hi

Madasafish said:
I am using gawk V4.
Great. Then the real multidimensional array will do the job. I would prefer this way :
Code:
awk -vOFS=, '[teal]{[/teal]x[teal][[/teal][navy]$1[/navy][teal]][[/teal][navy]$7[/navy][teal]][[/teal][navy]$9[/navy][teal]]++[/teal][teal]}[/teal]END[teal]{[/teal][b]for[/b][teal]([/teal]h [teal]in[/teal] x[teal])[/teal][b]for[/b][teal]([/teal]u [teal]in[/teal] x[teal][[/teal]h[teal]])[/teal][b]for[/b][teal]([/teal]s [teal]in[/teal] x[teal][[/teal]h[teal]][[/teal]u[teal]])[/teal][COLOR=chocolate]print[/color] x[teal][[/teal]h[teal]][[/teal]u[teal]][[/teal]s[teal]],[/teal]h[teal],[/teal]u[teal],[/teal]s[teal]}[/teal]' /var/log/httpd/access_log
Which produces count,host,path,status output like this :
Code:
1,192.168.0.1,/mustache/syntax.htm,200
1,192.168.6.1,/mustache/syntax.htm,200
1,192.168.0.1,/mustache/style.css,200
5,192.168.6.1,/mustache/style.css,304
7,192.168.6.1,/mustache/style.css,200

To exactly reproduce your sitea,siteb,path,status output format :
Code:
awk -vOFS=, '[teal]{[/teal]x[teal][[/teal][navy]$7[/navy][teal]][[/teal][navy]$9[/navy][teal]][[/teal][navy]$1[/navy][teal]~[/teal][fuchsia]/192\.168\.2/[/fuchsia][teal]?[/teal][green][i]"a"[/i][/green][teal]:[/teal][green][i]"b"[/i][/green][teal]]++[/teal][teal]}[/teal]END[teal]{[/teal][b]for[/b][teal]([/teal]u [teal]in[/teal] x[teal])[/teal][b]for[/b][teal]([/teal]s [teal]in[/teal] x[teal][[/teal]u[teal]])[/teal][COLOR=chocolate]print[/color] x[teal][[/teal]u[teal]][[/teal]s[teal]][[/teal][green][i]"a"[/i][/green][teal]]+[/teal][purple]0[/purple][teal],[/teal]x[teal][[/teal]u[teal]][[/teal]s[teal]][[/teal][green][i]"b"[/i][/green][teal]]+[/teal][purple]0[/purple][teal],[/teal]u[teal],[/teal]s[teal]}[/teal]' /var/log/httpd/access_log
Which produces this output from the same input data :
Code:
1,1,/mustache/syntax.htm,200
5,0,/mustache/style.css,304
7,1,/mustache/style.css,200


Feherke.
[link feherke.github.com/][/url]
 
Absolutely brilliant!

Thank-you very much Feherke. A payment to the club is long overdue [thumbsup2]

As always, now the hard work has been done I think I can embellish it, only to find I get stuck again.

As it's a 3852 line CSV file it lends itself to Spreadsheet filters. I am trying to add a couple of filters and cannot understand why it will not work. In your print statement you print "u". If I try to split "u" or try to reference a string in "u" it stops working and do not understand why?.

Here is the working code only if my "embelishments" are commented out.
Code:
gawk -vOFS=, '{
                x[$7][$9][$1~/192\.168\.2/?"a":"b"]++
        }
        END {
                for(u in x)
                for(s in x[u])

                #split(u,z,"/")
                #if (z[2] ~ /debug/) ft1="Debug"
                #if (z[3] ~ /vod/) ft1="Vod"
                #if (z[4] ~ /flashapp.xml/) ft1="FlashApp"
                #if (u ~ /SSI|Sky|sky/) ft2="Sky"
                #if (u ~ /bbc/) ft2="BBC"

                print x[u][s]["b"]+x[u][s]["a"]+0,x[u][s]["b"]+0,x[u][s]["a"]+0,u,s,ft1,ft2

        }' $FILE

As always, Thanks in advance
Madasafish




 
Hi

Madasafish said:
In your print statement you print "u". If I try to split "u" or try to reference a string in "u" it stops working and do not understand why?
Just as you wrote, I [tt]print[/tt] u.

But you are [tt]split()[/tt]ing, doing 5 conditional assignments, then [tt]print[/tt]ing. The [tt]for[/tt] statement will execute only the very next one instruction. To make the [tt]for[/tt] execute all those, enclose them in braces ( {} ).

Feherke.
[link feherke.github.com/][/url]
 
What about this ?
Code:
gawk -vOFS=, '{
          x[$7][$9][$1~/192\.168\.2/?"a":"b"]++
        }
        END {
          for(u in x) [!]{[/!]
            split(u,z,"/")
            if (z[2] ~ /debug/) ft1="Debug"
            if (z[3] ~ /vod/) ft1="Vod"
            if (z[4] ~ /flashapp.xml/) ft1="FlashApp"
            if (u ~ /SSI|Sky|sky/) ft2="Sky"
            if (u ~ /bbc/) ft2="BBC"
            for(s in x[u])
              print x[u][s]["b"]+x[u][s]["a"]+0,x[u][s]["b"]+0,x[u][s]["a"]+0,u,s,ft1,ft2
          [!]}[/!]
        }' $FILE

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Another embelishment
Sorry :-(

Code:
 gawk -v OFS="," -v hoururl=$HOURURL -v hourdir=$HOURDIR '
        {
                split($4,b,/:/)
                hour=b[2]
                min=b[3]
                url=$7
                gsub(/%20/,"_",url)
                split(url,f,"/")
                appname=f[3]


                if (f[4] ~ /flashapp.xml/) {
                        url="\"=HYPERLINK(\"\""hoururl"/"appname".csv\"\",\"\""$7"\"\")\""
                        t[appname][hour][$1~/10\.185\.116/?"c":"d"]++
                        }

                x[url][$9][$1~/192\.168\.2/?"a":"b"]++
        }
        END {
                for(m in t) {
                        for(n in t[m]) {
                                n=sprintf("%02d",n) #<---Does not work
                                print n,t[m][n]["d"]+0,t[m][n]["c"]+0 > hourdir"/"m".csv"
                                }
                        }

                for(u in x) {
                        for(s in x[u]) {
                                print x[u][s]["b"]+x[u][s]["a"]+0,x[u][s]["b"]+0,x[u][s]["a"]+0,u,s,ft1,ft2
                                }
                        }

        }' ${INFILE} | sort -n -r >> ${OUTFILE}

Ferherke's code is perfect. I have removed the filters mentioned earlier for clarity.
As you can see I introduced another loop which provides hourly hits in seperate files "m". The hours "n" run from 00 to 23.

I have three gotcha's.
For the hours 00 to 09 it prints 0 to 9. (single digits). I would ideally like double digits 00 to 09.
The file/s it creates are not sorted for the hours 00 through to 23. Can this be done within the gawk prog?
I need a header of Hour,Site A,Site B for each file created.

As always, thanks in advance

Madasafish
 
Code:
for(m in t) {
                        for(n in t[m]) {
                                print "=\""n"\"",t[m][n]["d"]+0,t[m][n]["c"]+0 > hourdir"/"m
                                }

For the benefit of other readers,

For the hours 00 to 09 it prints 0 to 9. (single digits). I would ideally like double digits 00 to 09.

It was Excell that was truncating the leading zero. I managed to fix this using the above syntax for "n".

The file/s it creates are not sorted for the hours 00 through to 23. Can this be done within the gawk prog?
Unfortunately this is way beyond my remit and resolved this with an external bash "for loop" at the end which uses the excellent sort command. I would welcome a gawk solution

I need a header of Hour,Site A,Site B for each file created.

Easily accommodated with the external bash "for loop". Again, would be very interested in seeing a gawk solution with the above code.

Cheers,

Madasafish



 
1) "hour" is extracted from a string. So it is a string. To consider it as a number (in order to format it), I would have use a command such as int()

[edit] reading again your code, I mentionned you have 2 n. One for the loop and one to store the result of sprintf. This is not good.[edit]

Code:
printf("%02d",int(n))

2) to sort in awk like outside awk:
Code:
system("MyPersonalExternalCommand")
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top