Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

need help with summarizing 2

Status
Not open for further replies.

rufflocks

Technical User
Jun 29, 2001
26
US
I want to read in an input file of numbers.
I want the output to give me each unique number found and the count for each number found.

Currently, my script prints all the numbers with the count ascending.
sample output:
1 - 1
1 - 2
1 - 3
2 - 1
2 - 2
2 - 3
2 - 4
3 - 1
3 - 2

What I want is:
1 - 3
2 - 4
3 - 2
 

Code:
{ gsub( /[^-0-9]/, "" ) ## Leave only numerals and minus sign.
  if ( ""==$0 || "-"==$0 )
    next
  a[$0]++
  if ( ""==lowest || $0 < lowest )
    lowest = $0
  if ( ""==highest || $0 > highest )
    highest = $0
}
END {
  for ( i=lowest; i<=highest; i++ )
    if ( i in a )
      printf "%4d -- %2d\n", i, a[i]
}
 
This works, but relies on *nix-style sort command to get output in proper order ...
Code:
{ ++a[$0] }
END {for (i in a) print i, "-", a[i] | "sort +0 -n"}
 
Here's a solution that doesn't rely on external sort.
Code:
{
    if (!($0 in a)) indices[++i] = $0
     ++a[$0] 
}
END {
    asort(indices)
    for (j=1; j<=i; j++) 
        printf "%4d - %2d\n", indices[j], a[indices[j]]
}

 
Hello, rufflocks!

Try this awk command, please:

Code:
awk 'NR == 1 { m = $1 }  $1 !~ m { print prev }  { prev = $0; m = $1 }  END { print prev }' datafile

God bless you.

Bye!

KP.
 
krunek, I don't get it. Where are the counts?
futurelet, I like your solution but am disturbed by the looping where you check if (i in a) for each value from lowest to highest. If there are large gaps you spend lots of time looping over and discarding useless values.

I apologize for not knowing asort isn't in awk. As you can tell, I'm no awk expert. In an attempt to redeem myself, and assuming the version that pipes the output to *nix sort isn't acceptable, here's a variation on my earlier solution that just uses its own sort routine.
Code:
function bubble(a, n,    sorted, j, temp) {
    while (!sorted) {
        sorted = 1
        for (j=1; j<n; j++) {
            if (a[j] > a[j+1]) {
                temp = a[j]
                a[j] = a[j+1]
                a[j+1] = temp
                sorted = 0
            }
        }
    }
    return
}

{
    if (!($0 in a)) indices[++i] = $0
     ++a[$0] 
}
END {
    bubble(indices, i)
    for (k=1; k<=i; k++) 
        printf "%4d - %2d\n", indices[k], a[indices[k]]
}
 
The numbers that rufflocks provided:
[tt]
1 - 1
1 - 2
1 - 3
2 - 1
2 - 2
2 - 3
2 - 4
3 - 1
3 - 2
[/tt]
are sample output, not input. Even if the data
were in that form, Krunek's solution would fail with
[tt]
1 - 1
1 - 2
1 - 3
11 - 1
11 - 2
[/tt]
You'd have to change
[tt]
$1 !~ m
[/tt]
to
[tt]
$1 != m
[/tt]

 
mivevh, your bubble sort program works perfectly, but, as you know, bubblesort is slow. Combsort is much faster for large arrays.
Code:
{ a[$0]++ }
END { n = combsort( a, b )
      for (i=1;i<=n;i++)
        printf "%4d -- %2d\n", b[i], a[b[i]] }

function combsort( array_in, array_out,
                   key,temp,i,numitems,sortflag,gap,top )
{ for ( key in array_in )
    array_out[++i] = key
  numitems = i
  gap = numitems
  while (gap > 1)
  { gap = int( gap / 1.3 )
    if ( !gap ) gap = 1
    else if ( 9==gap || 10==gap )  gap = 11
    top = numitems - gap
    sortflag = 1
    while ( sortflag )
    { sortflag = 0
      for (i=1; i <= top; i++ )
      { if( array_out[i] + 0 > array_out[i+gap] )
        { temp = array_out[i]
          array_out[i] = array_out[i+gap]
          array_out[i+gap] = temp
          sortflag = 1
        }
      }
      if ( gap >= top )  sortflag = 0
    }
  }
  return numitems
}

 
Yeah, the bubble sort was the best I could do off the top of my head.

I think your combsort is a type of Shell sort or diminishing increment sort. I've used somewhat different versions of this in the past, and it is significantly faster. A couple questions about this version of it:

1. Why divide by 1.3? Why not, say, 2?
2. Why are these lines necessary?
if ( !gap ) gap = 1
else if ( 9==gap || 10==gap ) gap = 11
...
if ( gap >= top ) sortflag = 0


Don't see why you need those?


 

> 1. Why divide by 1.3? Why not, say, 2?
> 2. Why are these lines necessary?
> if ( !gap ) gap = 1
> else if ( 9==gap || 10==gap ) gap = 11

The gurus who devised the algorithm decided that these numbers would yield maximal speed. ([tt]if (!gap) gap = 1[/tt] is needed to ensure that gap will end as 1 and not jump from 2 to 0.)

> if ( gap >= top ) sortflag = 0

This will sometimes reduce the number of while(sortflag) loops executed.

I created a file containing 99,000 random integers between 0 and 4999 and tested the program with awk95 and mawk. Mawk was 3 or 4 times faster. To get mawk to sort the items as numbers instead of strings:
[tt]
{ if( array_out + 0 > array_out[i+gap] + 0 )
[/tt]
Both awk95 and mawk produced the same output, but mawk performed fewer swaps!

 
You're right.
[tt]if ( !gap ) gap = 1[/tt]
isn't necessary. (I don't do arithmetic.)
 
Thanks, futurelet. Those lines I mentioned have a very ad hoc "Whoops, I goofed somewhere so now let me try to fix it" look about them, don't you think?

After fiddling with your combsort function, I came up with the following somewhat stripped-down version. Seems to work just fine.

I'm using gawk, and using "+ 0" in the comparison of array elements did not force a numeric comparison for me.
Using it when constructing array_out did.
Code:
function combsort( array_in, array_out,
                   key, numitems, gap, top, sorted, i, temp ) { 
    for ( key in array_in )
        array_out[++numitems] = key + 0
    gap = numitems
    while (gap = int( gap / 2 )) {
        top = numitems - gap
        sorted = 0
        while ( !sorted++ )
            for (i=1; i <= top; i+=gap )
                if( array_out[i] > array_out[i+gap] ) { 
                    temp = array_out[i]
                    array_out[i] = array_out[i+gap]
                    array_out[i+gap] = temp
                    sorted = 0
                }
    }
    return numitems
}
Thanks again.


 
I want to read in an input file of numbers.
I want the output to give me each unique number found and the count for each number found

Something like this ?
awk 'NF==1 && $1~/^[0-9]+$/{++a[$1]}END{for(i in a)printf "%d - %d\n",i,a}' /path/to/input

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
PHV, compare the first of my posts in this thread.
 
Thank you all for your input.

Krunek, I did try your solution but got:
awk: syntax error near line 1
awk: bailing out near line 1

I probably would have to tweek something.

PHV's solution worked for me. Thanks. Someone pointed me to unix line command "unique -c" that gave me similar results.

Thanks again.
 
mikevh gave 3 good answers to rufflocks' question before PHV posted his response.

Therefore, I'm giving mikevh the star that he deserves.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top