Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Count the occurrence of different field values.

Status
Not open for further replies.

dunk2

Technical User
Jun 10, 2002
14
IE
Hi,

I want to count the occurrence of a different values in the second field of a file.
Here's an example of the file'Vch_Head_NotSorted' ;
022276896167., 0000100.00,20050807
009377980266., 0000200.00,20050807
085142882117., 0000300.00,20050807
049156298766., 0001000.00,20050807
041342839181., 0001000.00,20050807
082753305176., 0001000.00,20050807
052027453477., 0000900.00,20050807
039021575680., 0000500.00,20050807
032238864563., 0000500.00,20050807
020162010606., 0000500.00,20050807

I've used the following unix command that sorts the file and counts the duplicates.
sort -k 2,2 Vch_Head_NotSorted | uniq -f 1 -c > Vch_Count_output.txt
This technique works fine for small files, however my file has 2,500,000 rows, and the output becomes fragmented...e.g.

4 169254214268., 0000001.00,20051129
11 024731477144., 0000002.00,20050807
4 167721924291., 0000002.00,20051129
70536 000008882400., 0000050.00,20010801
1 960128879889., 0000050.00,20020804
100 008308765031., 0000050.00,20021228
144 006136599792., 0000050.00,20040816
143 006452636359., 0000050.00,20040823
52 002133530269., 0000050.00,20040827
86 003102237868., 0000050.00,20040829
101 003655825721., 0000050.00,20040903
46 003931361433., 0000050.00,20040913

instead of ..

4 169254214268., 0000001.00,20051129
15 024731477144., 0000002.00,20050807
71209 000008882400., 0000050.00,20010801

I am not concerned with the first and last (3rd) fields, just the occurrence (count) of the different values in the second field and that value of the count.

Is there an awk based script that could produce this?
Many thanks in advance ;?)
 
uniq delimits fields with a space so it is treating your second and third fields as one field. There is probably a better way, but try this (untested) script

awk 'BEGIN{FS=","}{print $1 "," $2 ", " $3}' Vch_Head_NotSorted | sort -k 2,2 | uniq -f 1 -c > Vch_Count_output.txt
CaKiwi
 
Hi CaKiwi,

Unfortunately, your solution but it results in the same 'fragmented' output....e.g.
4 169254214268., 0000001.00, 20051129
3 024731477144., 0000002.00, 20050807
1 167721924291., 0000002.00, 20051129
3 476184123170., 0000002.00, 20050807
1 591235603972., 0000002.00, 20051129
1 730706678886., 0000002.00, 20050807
1 771207949566., 0000002.00, 20051129
3 903801674238., 0000002.00, 20050807
1 934188002323., 0000002.00, 20051129
1 956151281153., 0000002.00, 20050807
instead of
4 169254214268., 0000001.00, 20051129
15 024731477144., 0000002.00, 20050807

I'll try tweaking your solution
Thanks,
dunk2
 
My understanding of uniq was incorrect. The -f flag just ignores leading fields not trailing fields. Use the following awk script in place of uniq.

sort -k 2,2 Vch_Head_NotSorted | awk -f dunk1.awk > Vch_Count_output.txt

#dunk1.awk
BEGIN{FS=","}
{
if (sv2!=$2) {
if (NR>1) print cnt " " sv0
sv0 = $0
sv2 = $2
cnt = 0
}
cnt++
}
END { print cnt " " sv0 } CaKiwi
 
Many thanks CaKiwi,
it's a neat solution,
dunk2
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top