Count the occurrence of different field values.

dunk2 · Jan 10, 2003

Hi,

I want to count the occurrence of a different values in the second field of a file.
Here's an example of the file'Vch_Head_NotSorted' ;
022276896167., 0000100.00,20050807
009377980266., 0000200.00,20050807
085142882117., 0000300.00,20050807
049156298766., 0001000.00,20050807
041342839181., 0001000.00,20050807
082753305176., 0001000.00,20050807
052027453477., 0000900.00,20050807
039021575680., 0000500.00,20050807
032238864563., 0000500.00,20050807
020162010606., 0000500.00,20050807

I've used the following unix command that sorts the file and counts the duplicates.
sort -k 2,2 Vch_Head_NotSorted | uniq -f 1 -c > Vch_Count_output.txt
This technique works fine for small files, however my file has 2,500,000 rows, and the output becomes fragmented...e.g.

4 169254214268., 0000001.00,20051129
11 024731477144., 0000002.00,20050807
4 167721924291., 0000002.00,20051129
70536 000008882400., 0000050.00,20010801
1 960128879889., 0000050.00,20020804
100 008308765031., 0000050.00,20021228
144 006136599792., 0000050.00,20040816
143 006452636359., 0000050.00,20040823
52 002133530269., 0000050.00,20040827
86 003102237868., 0000050.00,20040829
101 003655825721., 0000050.00,20040903
46 003931361433., 0000050.00,20040913

instead of ..

4 169254214268., 0000001.00,20051129
15 024731477144., 0000002.00,20050807
71209 000008882400., 0000050.00,20010801

I am not concerned with the first and last (3rd) fields, just the occurrence (count) of the different values in the second field and that value of the count.

Is there an awk based script that could produce this?
Many thanks in advance ;?)

CaKiwi · Jan 10, 2003

uniq delimits fields with a space so it is treating your second and third fields as one field. There is probably a better way, but try this (untested) script

awk 'BEGIN{FS=","}{print $1 "," $2 ", " $3}' Vch_Head_NotSorted | sort -k 2,2 | uniq -f 1 -c > Vch_Count_output.txt
CaKiwi

dunk2 · Jan 10, 2003

Hi CaKiwi,

Unfortunately, your solution but it results in the same 'fragmented' output....e.g.
4 169254214268., 0000001.00, 20051129
3 024731477144., 0000002.00, 20050807
1 167721924291., 0000002.00, 20051129
3 476184123170., 0000002.00, 20050807
1 591235603972., 0000002.00, 20051129
1 730706678886., 0000002.00, 20050807
1 771207949566., 0000002.00, 20051129
3 903801674238., 0000002.00, 20050807
1 934188002323., 0000002.00, 20051129
1 956151281153., 0000002.00, 20050807
instead of
4 169254214268., 0000001.00, 20051129
15 024731477144., 0000002.00, 20050807

I'll try tweaking your solution
Thanks,
dunk2

CaKiwi · Jan 10, 2003

My understanding of uniq was incorrect. The -f flag just ignores leading fields not trailing fields. Use the following awk script in place of uniq.

sort -k 2,2 Vch_Head_NotSorted | awk -f dunk1.awk > Vch_Count_output.txt

#dunk1.awk
BEGIN{FS=","}
{
if (sv2!=$2) {
if (NR>1) print cnt " " sv0
sv0 = $0
sv2 = $2
cnt = 0
}
cnt++
}
END { print cnt " " sv0 } CaKiwi

dunk2 · Jan 12, 2003

Many thanks CaKiwi,
it's a neat solution,
dunk2

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Count the occurrence of different field values.

dunk2

Technical User

CaKiwi

Programmer

dunk2

Technical User

CaKiwi

Programmer

dunk2

Technical User

Similar threads

Part and Inventory Search

Sponsor