Filtering the lines by field similarity 3

Scifi9547 · May 6, 2010

Hi folks. I have data something like this:

25,AGH,0.2
25,GHT,0.5
25,jsw,1.7
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2

The logic I want to input here is that IF field 1 of lines are same, then isolate the line with the least field 3 value in to a new file.

For example - lines 1,2,3 have same field 1 value i.e. 25. So, now among those three lines, I need to isolate the line that has lowest field 3 value. So, here it is line 1 which has 0.2 in its field 3. I need to put these isolate lines in a seperate file.

If there are two lines with same field 1 and field 3 values, isolate both of them.

My desired out put will be

25,AGH,0.2
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2

I am like a toddler in this field and tried different scripts and each time I embarrassed myself. lol.

I really appreciate any help. TIA

Scifi9547 · May 6, 2010

oops - small correction

Here is the input file *same as the above except with an additional line at 5. Lines 5,6.7 have same field 1, but i need to isolate only lines 6 and 7 because they have same least field 3 values (1.2).

25,AGH,0.2
25,GHT,0.5
25,jsw,1.7
34,TUY,3.0
37,KJJ,2.3
37,KJU,1.2
37,KJT,1.2

The output is same.

thanks

feherke · May 6, 2010

Hi

As you forgot to mention if $1 is ordered or not, I supposed not, so this solutions collects all data before displaying the output :

Code:

awk -F, '$1in t&&$3==t[$1]{s[$1]=s[$1]" "$2}!($1in t)||$3<t[$1]{s[$1]=$2;t[$1]=$3;}END{for(i in t){split(s[i],a," ");for(j in a)print i,a[j],t[i]}}' /input/file

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.

http://free.rootshell.be/~feherke/

Scifi9547 · May 7, 2010

@feherke

THanks a lot for the code, but it doesn't seem to work. it runs but doesnt print anything. What am I missing here? thanks

Annihilannic · May 7, 2010

What OS/platform are you doing this on? You may need to use nawk instead of awk.

Annihilannic.

Scifi9547 · May 9, 2010

///What OS/platform are you doing this on? You may need to use nawk instead of awk.////

I am using windows XP professional. I copied and pasted in the script and tried and it did not work. I use gawk.

Annihilannic · May 9, 2010

Ah. Embedding an awk script in the command-line doesn't work so well under Windows XP (unless you are using a Cygwin bash shell or something).

Try pasting this part into a separate file, say least.awk:

Code:

BEGIN{RS=ORS=","}$1in t&&$3==t[$1]{s[$1]=s[$1]" "$2}!($1in t)||$3<t[$1]{s[$1]=$2;t[$1]=$3;}END{for(i in t){split(s[i],a," ");for(j in a)print i,a[j],t[i]}}

Then type gawk -f least.awk input.file at the command prompt.

Annihilannic.

Scifi9547 · May 9, 2010

///Then type gawk -f least.awk input.file at the command prompt.///

Thanks again. The script is working. But the out put is not what I expected. The out put is coming like this:

0.5 25 ,1.2 37 ,1.7 34 ,2.3 37 ,3.0 37 ,0.2 25 ,

But I need an out put like below:

25,AGH,0.2
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2

To answer your previous question, yes they are sorted first by column 1 and then by column 3 (both ascending order). Sorry for the confusion. Here is the correct input.

25,AGH,0.2
25,GHT,0.5
25,jsw,1.7
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2
37,KJJ,2.3

pat two days, I have been banging my head all the walls of my apartment while trying to write a script in perl. I wanted to try to decide which, awk or perl is friendlier in handling text data.

Annihilannic · May 9, 2010

Sorry, my mistake. It should be FS=OFS=",", not RS=ORS=",".

I was setting the record separator instead of the field separator (to replace Feherke's -F, option.

Annihilannic.

Scifi9547 · May 10, 2010

///Sorry, my mistake. It should be FS=OFS=",", not RS=ORS=",". ///

Thank you very much. Its working perfectly.

I have a related question. I have another table but with 19 columns.

2|1,23,8,mm9,+,682,13,0.70,1.0,-1,6,18,28,40,cga,cga,|||,13,13

I want the same function to be performed as above except that this time its Field 1 and Field 8. That means, if field 1 is same for consecutive lines, extract the line with least field 8 value.

I tried to modify the above code to apply for this table, but again I utterly failed.

Scifi9547 · May 10, 2010

The actual file is very big (1.4GB with 8 million lines). I did not post even a sample here because it might have looked very complex. ANyhow, I am posting first fewlines here:

2|1,23,8,NM_175153,chr9:64808744-64812188 + ,682,13,0.70,1.000,-1,6,18,28,40,cgaaacacagccc,cgaaacacagccc,|||||||||||||,13,13
2|1,23,8,NM_001039388,chr13:8860472-8870976 - ,288,12,2.8,1.000,-1,11,22,169,180,cacagcccaaaa,cacagcccaaaa,||||||||||||,12,12
2|1,23,8,NM_001079883,chr12:109241354-109241624 - ,271,12,2.8,1.000,1,8,19,83,94,aaacacagccca,aaacacagccca,||||||||||||,12,12
3|1,27,2,NM_007532,chr6:145024415-145024677 - ,263,12,3.9,1.000,-1,12,23,79,90,ctacagtccgac,ctacagtccgac,||||||||||||,12,12
3|1,27,2,NM_010125,chr2:103264225-103270449 + ,125,12,3.9,1.000,1,10,21,35,46,ttctacagtccg,ttctacagtccg,||||||||||||,12,12
4|1,16,30,NR_003492,chr11:88833952-88834328 + ,377,13,0.28,1.000,1,2,14,186,198,aactctgagggaa,aactctgagggaa,|||||||||||||,13,13
4|1,16,30,NR_027978,chr18:77952923-77954184 + ,1262,12,1.1,1.000,1,5,16,937,948,tctgagggaaat,tctgagggaaat,||||||||||||,12,12
4|1,16,30,NM_010771,chr18:35721812-35731677 + ,316,12,1.1,1.000,-1,1,12,129,140,caactctgaggg,caactctgaggg,||||||||||||,12,12
5|1,12,1,NM_026186,chr11:97627757-97627927 - ,171,11,2.2,1.000,-1,2,12,65,75,ctaatccctac,ctaatccctac,|||||||||||,11,11

There are about 19 columns seperated by commas. The file is sorted by first column and next by the 8th column in ascending order.

As long as lines have same first field value, I want to compare the 8th field values among them and isolate the line or lines with least 8th field value.

Thanks in advance

Annihilannic · May 10, 2010

As feherke mentioned in his original reply, if you had mentioned that it was sorted initially it would have changed things completely.

Also that solution stores all of the results in memory before printing them, so with such a large file it could be quite costly performance-wise.

With that new information, the solution is simpler because we only need to detect when the first field changes:

Code:

[green]BEGIN[/green] { [blue]FS[/blue]=[red]"[/red][purple],[/purple][red]"[/red] }
[blue]$1[/blue] != prev1 { [b]print[/b] }
{ prev1=[blue]$1[/blue] }

Annihilannic.

PHV · May 10, 2010

isolate the line or lines with least 8th field value

Code:

BEGIN{FS=","}
$1!=prev1{print;prev1=$1;least8=$8;next}
$8==least8

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

Scifi9547 · May 10, 2010

Feherke, Annihilannic, and PHV - Thank you very much brothers. Its working perfectly. You rock!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Filtering the lines by field similarity 3

Scifi9547

Technical User

Scifi9547

Technical User

feherke

Programmer

Scifi9547

Technical User

Annihilannic

MIS

Scifi9547

Technical User

Annihilannic

MIS

Scifi9547

Technical User

Annihilannic

MIS

Scifi9547

Technical User

Scifi9547

Technical User

Annihilannic

MIS

PHV

MIS

Scifi9547

Technical User

Similar threads

Part and Inventory Search

Sponsor