Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Filtering the lines by field similarity 3

Status
Not open for further replies.

Scifi9547

Technical User
May 6, 2010
10
US
Hi folks. I have data something like this:

25,AGH,0.2
25,GHT,0.5
25,jsw,1.7
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2

The logic I want to input here is that IF field 1 of lines are same, then isolate the line with the least field 3 value in to a new file.

For example - lines 1,2,3 have same field 1 value i.e. 25. So, now among those three lines, I need to isolate the line that has lowest field 3 value. So, here it is line 1 which has 0.2 in its field 3. I need to put these isolate lines in a seperate file.

If there are two lines with same field 1 and field 3 values, isolate both of them.

My desired out put will be

25,AGH,0.2
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2

I am like a toddler in this field and tried different scripts and each time I embarrassed myself. lol.

I really appreciate any help. TIA

 
oops - small correction

Here is the input file *same as the above except with an additional line at 5. Lines 5,6.7 have same field 1, but i need to isolate only lines 6 and 7 because they have same least field 3 values (1.2).

25,AGH,0.2
25,GHT,0.5
25,jsw,1.7
34,TUY,3.0
37,KJJ,2.3
37,KJU,1.2
37,KJT,1.2

The output is same.

thanks
 
Hi

As you forgot to mention if $1 is ordered or not, I supposed not, so this solutions collects all data before displaying the output :
Code:
awk -F, '$1in t&&$3==t[$1]{s[$1]=s[$1]" "$2}!($1in t)||$3<t[$1]{s[$1]=$2;t[$1]=$3;}END{for(i in t){split(s[i],a," ");for(j in a)print i,a[j],t[i]}}' /input/file
Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.
 
@feherke

THanks a lot for the code, but it doesn't seem to work. it runs but doesnt print anything. What am I missing here? thanks
 
What OS/platform are you doing this on? You may need to use nawk instead of awk.

Annihilannic.
 
///What OS/platform are you doing this on? You may need to use nawk instead of awk.////


I am using windows XP professional. I copied and pasted in the script and tried and it did not work. I use gawk.
 
Ah. Embedding an awk script in the command-line doesn't work so well under Windows XP (unless you are using a Cygwin bash shell or something).

Try pasting this part into a separate file, say least.awk:

Code:
BEGIN{RS=ORS=","}$1in t&&$3==t[$1]{s[$1]=s[$1]" "$2}!($1in t)||$3<t[$1]{s[$1]=$2;t[$1]=$3;}END{for(i in t){split(s[i],a," ");for(j in a)print i,a[j],t[i]}}

Then type gawk -f least.awk input.file at the command prompt.

Annihilannic.
 

///Then type gawk -f least.awk input.file at the command prompt.///

Thanks again. The script is working. But the out put is not what I expected. The out put is coming like this:

0.5 25 ,1.2 37 ,1.7 34 ,2.3 37 ,3.0 37 ,0.2 25 ,

But I need an out put like below:

25,AGH,0.2
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2

To answer your previous question, yes they are sorted first by column 1 and then by column 3 (both ascending order). Sorry for the confusion. Here is the correct input.

25,AGH,0.2
25,GHT,0.5
25,jsw,1.7
34,TUY,3.0
37,KJU,1.2
37,KJT,1.2
37,KJJ,2.3

pat two days, I have been banging my head all the walls of my apartment while trying to write a script in perl. I wanted to try to decide which, awk or perl is friendlier in handling text data.
 
Sorry, my mistake. It should be FS=OFS=",", not RS=ORS=",".

I was setting the record separator instead of the field separator (to replace Feherke's -F, option.

Annihilannic.
 
///Sorry, my mistake. It should be FS=OFS=",", not RS=ORS=",". ///

Thank you very much. Its working perfectly.

I have a related question. I have another table but with 19 columns.

2|1,23,8,mm9,+,682,13,0.70,1.0,-1,6,18,28,40,cga,cga,|||,13,13

I want the same function to be performed as above except that this time its Field 1 and Field 8. That means, if field 1 is same for consecutive lines, extract the line with least field 8 value.

I tried to modify the above code to apply for this table, but again I utterly failed.


 
The actual file is very big (1.4GB with 8 million lines). I did not post even a sample here because it might have looked very complex. ANyhow, I am posting first fewlines here:

2|1,23,8,NM_175153,chr9:64808744-64812188 + ,682,13,0.70,1.000,-1,6,18,28,40,cgaaacacagccc,cgaaacacagccc,|||||||||||||,13,13
2|1,23,8,NM_001039388,chr13:8860472-8870976 - ,288,12,2.8,1.000,-1,11,22,169,180,cacagcccaaaa,cacagcccaaaa,||||||||||||,12,12
2|1,23,8,NM_001079883,chr12:109241354-109241624 - ,271,12,2.8,1.000,1,8,19,83,94,aaacacagccca,aaacacagccca,||||||||||||,12,12
3|1,27,2,NM_007532,chr6:145024415-145024677 - ,263,12,3.9,1.000,-1,12,23,79,90,ctacagtccgac,ctacagtccgac,||||||||||||,12,12
3|1,27,2,NM_010125,chr2:103264225-103270449 + ,125,12,3.9,1.000,1,10,21,35,46,ttctacagtccg,ttctacagtccg,||||||||||||,12,12
4|1,16,30,NR_003492,chr11:88833952-88834328 + ,377,13,0.28,1.000,1,2,14,186,198,aactctgagggaa,aactctgagggaa,|||||||||||||,13,13
4|1,16,30,NR_027978,chr18:77952923-77954184 + ,1262,12,1.1,1.000,1,5,16,937,948,tctgagggaaat,tctgagggaaat,||||||||||||,12,12
4|1,16,30,NM_010771,chr18:35721812-35731677 + ,316,12,1.1,1.000,-1,1,12,129,140,caactctgaggg,caactctgaggg,||||||||||||,12,12
5|1,12,1,NM_026186,chr11:97627757-97627927 - ,171,11,2.2,1.000,-1,2,12,65,75,ctaatccctac,ctaatccctac,|||||||||||,11,11

There are about 19 columns seperated by commas. The file is sorted by first column and next by the 8th column in ascending order.

As long as lines have same first field value, I want to compare the 8th field values among them and isolate the line or lines with least 8th field value.

Thanks in advance
 
As feherke mentioned in his original reply, if you had mentioned that it was sorted initially it would have changed things completely. :)

Also that solution stores all of the results in memory before printing them, so with such a large file it could be quite costly performance-wise.

With that new information, the solution is simpler because we only need to detect when the first field changes:

Code:
[green]BEGIN[/green] { [blue]FS[/blue]=[red]"[/red][purple],[/purple][red]"[/red] }
[blue]$1[/blue] != prev1 { [b]print[/b] }
{ prev1=[blue]$1[/blue] }

Annihilannic.
 
isolate the line or lines with least 8th field value
Code:
BEGIN{FS=","}
$1!=prev1{print;prev1=$1;least8=$8;next}
$8==least8

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Feherke, Annihilannic, and PHV - Thank you very much brothers. Its working perfectly. You rock!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top