need help with summarizing 2

rufflocks · Oct 1, 2004

I want to read in an input file of numbers.
I want the output to give me each unique number found and the count for each number found.

Currently, my script prints all the numbers with the count ascending.
sample output:
1 - 1
1 - 2
1 - 3
2 - 1
2 - 2
2 - 3
2 - 4
3 - 1
3 - 2

What I want is:
1 - 3
2 - 4
3 - 2

futurelet · Oct 1, 2004

Code:

{ gsub( /[^-0-9]/, "" ) ## Leave only numerals and minus sign.
  if ( ""==$0 || "-"==$0 )
    next
  a[$0]++
  if ( ""==lowest || $0 < lowest )
    lowest = $0
  if ( ""==highest || $0 > highest )
    highest = $0
}
END {
  for ( i=lowest; i<=highest; i++ )
    if ( i in a )
      printf "%4d -- %2d\n", i, a[i]
}

mikevh · Oct 1, 2004

This works, but relies on *nix-style sort command to get output in proper order ...

Code:

{ ++a[$0] }
END {for (i in a) print i, "-", a[i] | "sort +0 -n"}

mikevh · Oct 1, 2004

Here's a solution that doesn't rely on external sort.

Code:

{
    if (!($0 in a)) indices[++i] = $0
     ++a[$0] 
}
END {
    asort(indices)
    for (j=1; j<=i; j++) 
        printf "%4d - %2d\n", indices[j], a[indices[j]]
}

futurelet · Oct 2, 2004

[tt]asort()[/tt] isn't in Awk. What language are you using?

mikevh · Oct 2, 2004

gawk. Didn't know asort wasn't in awk.

Krunek · Oct 2, 2004

Hello, rufflocks!

Try this awk command, please:

Code:

awk 'NR == 1 { m = $1 }  $1 !~ m { print prev }  { prev = $0; m = $1 }  END { print prev }' datafile

God bless you.

Bye!

KP.

mikevh · Oct 2, 2004

krunek, I don't get it. Where are the counts?
futurelet, I like your solution but am disturbed by the looping where you check if (i in a) for each value from lowest to highest. If there are large gaps you spend lots of time looping over and discarding useless values.

I apologize for not knowing asort isn't in awk. As you can tell, I'm no awk expert. In an attempt to redeem myself, and assuming the version that pipes the output to *nix sort isn't acceptable, here's a variation on my earlier solution that just uses its own sort routine.

Code:

function bubble(a, n,    sorted, j, temp) {
    while (!sorted) {
        sorted = 1
        for (j=1; j<n; j++) {
            if (a[j] > a[j+1]) {
                temp = a[j]
                a[j] = a[j+1]
                a[j+1] = temp
                sorted = 0
            }
        }
    }
    return
}

{
    if (!($0 in a)) indices[++i] = $0
     ++a[$0] 
}
END {
    bubble(indices, i)
    for (k=1; k<=i; k++) 
        printf "%4d - %2d\n", indices[k], a[indices[k]]
}

futurelet · Oct 2, 2004

The numbers that rufflocks provided:
[tt]
1 - 1
1 - 2
1 - 3
2 - 1
2 - 2
2 - 3
2 - 4
3 - 1
3 - 2
[/tt]
are sample output, not input. Even if the data
were in that form, Krunek's solution would fail with
[tt]
1 - 1
1 - 2
1 - 3
11 - 1
11 - 2
[/tt]
You'd have to change
[tt]
$1 !~ m
[/tt]
to
[tt]
$1 != m
[/tt]

mikevh · Oct 2, 2004

Thanks, futurelet. Glad it wasn't just me. [3eyes]

futurelet · Oct 2, 2004

mivevh, your bubble sort program works perfectly, but, as you know, bubblesort is slow. Combsort is much faster for large arrays.

Code:

{ a[$0]++ }
END { n = combsort( a, b )
      for (i=1;i<=n;i++)
        printf "%4d -- %2d\n", b[i], a[b[i]] }

function combsort( array_in, array_out,
                   key,temp,i,numitems,sortflag,gap,top )
{ for ( key in array_in )
    array_out[++i] = key
  numitems = i
  gap = numitems
  while (gap > 1)
  { gap = int( gap / 1.3 )
    if ( !gap ) gap = 1
    else if ( 9==gap || 10==gap )  gap = 11
    top = numitems - gap
    sortflag = 1
    while ( sortflag )
    { sortflag = 0
      for (i=1; i <= top; i++ )
      { if( array_out[i] + 0 > array_out[i+gap] )
        { temp = array_out[i]
          array_out[i] = array_out[i+gap]
          array_out[i+gap] = temp
          sortflag = 1
        }
      }
      if ( gap >= top )  sortflag = 0
    }
  }
  return numitems
}

mikevh · Oct 3, 2004

Yeah, the bubble sort was the best I could do off the top of my head.

I think your combsort is a type of Shell sort or diminishing increment sort. I've used somewhat different versions of this in the past, and it is significantly faster. A couple questions about this version of it:

1. Why divide by 1.3? Why not, say, 2?
2. Why are these lines necessary?
if ( !gap ) gap = 1
else if ( 9==gap || 10==gap ) gap = 11
...
if ( gap >= top ) sortflag = 0

Don't see why you need those?

futurelet · Oct 3, 2004

> 1. Why divide by 1.3? Why not, say, 2?
> 2. Why are these lines necessary?
> if ( !gap ) gap = 1
> else if ( 9==gap || 10==gap ) gap = 11

The gurus who devised the algorithm decided that these numbers would yield maximal speed. ([tt]if (!gap) gap = 1[/tt] is needed to ensure that gap will end as 1 and not jump from 2 to 0.)

> if ( gap >= top ) sortflag = 0

This will sometimes reduce the number of while(sortflag) loops executed.

I created a file containing 99,000 random integers between 0 and 4999 and tested the program with awk95 and mawk. Mawk was 3 or 4 times faster. To get mawk to sort the items as numbers instead of strings:
[tt]
{ if( array_out + 0 > array_out[i+gap] + 0 )
[/tt]
Both awk95 and mawk produced the same output, but mawk performed fewer swaps!

futurelet · Oct 3, 2004

You're right.
[tt]if ( !gap ) gap = 1[/tt]
isn't necessary. (I don't do arithmetic.)

mikevh · Oct 3, 2004

Thanks, futurelet. Those lines I mentioned have a very ad hoc "Whoops, I goofed somewhere so now let me try to fix it" look about them, don't you think?

After fiddling with your combsort function, I came up with the following somewhat stripped-down version. Seems to work just fine.

I'm using gawk, and using "+ 0" in the comparison of array elements did not force a numeric comparison for me.
Using it when constructing array_out did.

Code:

function combsort( array_in, array_out,
                   key, numitems, gap, top, sorted, i, temp ) { 
    for ( key in array_in )
        array_out[++numitems] = key + 0
    gap = numitems
    while (gap = int( gap / 2 )) {
        top = numitems - gap
        sorted = 0
        while ( !sorted++ )
            for (i=1; i <= top; i+=gap )
                if( array_out[i] > array_out[i+gap] ) { 
                    temp = array_out[i]
                    array_out[i] = array_out[i+gap]
                    array_out[i+gap] = temp
                    sorted = 0
                }
    }
    return numitems
}

Thanks again.

PHV · Oct 4, 2004

I want to read in an input file of numbers.
I want the output to give me each unique number found and the count for each number found
Something like this ?
awk 'NF==1 && $1~/^[0-9]+$/{++a[$1]}END{for(i in a)printf "%d - %d\n",i,a}' /path/to/input

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

mikevh · Oct 4, 2004

PHV, compare the first of my posts in this thread.

rufflocks · Oct 5, 2004

Thank you all for your input.

Krunek, I did try your solution but got:
awk: syntax error near line 1
awk: bailing out near line 1

I probably would have to tweek something.

PHV's solution worked for me. Thanks. Someone pointed me to unix line command "unique -c" that gave me similar results.

Thanks again.

futurelet · Oct 5, 2004

mikevh gave 3 good answers to rufflocks' question before PHV posted his response.

Therefore, I'm giving mikevh the star that he deserves.

mikevh · Oct 5, 2004

Thanks, futurelet.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

need help with summarizing 2

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

MIS

Programmer

Technical User

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor