Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Overlap filter in between two file

Status
Not open for further replies.

demis001

Programmer
Aug 18, 2008
94
US
Need help,

data1

1 59851 59880 CATTCTAGTGTAAAGTTTTAGATCTTATAT
1 59881 59910 AACTGTGAGATTAATCTCAGATAATGACAC
1 59911 59940 AAAATATAGTGAAGTTGGTAAGTTATTTAG
1 59941 59970 TAAAGCTCATGAAAATTGTGCCCTCCATTC
1 59971 60000 CCATATAATTTAGTAATTGTCTAGGAACTT
1 60001 60030 CCACATACATTGCCTCAATTTATCTTTCAA
1 60031 60060 CAACTTGTGTGTTATATTTTGGAATACAGA
1 60061 60090 TACAAAGTTATTATGCTTTCAAAATATTCT
1 60091 60120 TTTGCTAATTCTTAGAACAAAGAAAGGCAT
1 60121 60150 AAATATATTAGTATTTGTGTACACCTGTTC
1 60151 60180 CTTCCTGTGTGACCCTAAGTTTAGTAGAAG
1 60181 60210 AAAGGAGAGAAAATATAGCCTAGCTTATAA
1 60211 60240 ATTTAAAAAAAAATTTATTTGGTCCATTTT

data2

1 59871 58954 ENSP00000317482 OR4F5
1 358460 357522 ENSP00000318226 OR4F29
1 611897 610959 ENSP00000329982 OR4F16
1 712376 711183 ENSP00000351335 AL669831.13
1 745077 742614 ENSP00000317958 FAM87B
1 869824 850393 ENSP00000349216 SAMD11

For each line in data2, I want the line from data1, if $2 and $3 range of data1 is overlap with $2 and $3 of data2. That means, if ( data1[$2]>=data2[$2] && data1[$3]<=data2[3]), I want to print the result as follows:

Result(the first three are from data1 and the last three are from data2)


1 59881 59910 59871 58954 OR4F5
1 59911 59940 59871 58954 OR4F5
1 59941 59970 59871 58954 OR4F5

I have tried this and complain about syntax error

awk 'NR==FNR{a[$1","$2","$3]=$0;next}$2, $3 in a {if(a[1]<= $2 && a[2]<=$3]) print $2","$3\t$0

Thanks

 
Your output does not jive exactly with your sample input file and your 'testing' formula.

Re-validate your either your 'formula' or your desired output.

I've implemented your formula as is:
nawk -f demis.awk data1 data2
Code:
FNR==NR {
   f1[$2,$3]=$1
   next
}
{
   for (f1iter in f1) {
     split(f1iter, idxA, SUBSEP)
     if (int(idxA[1]) >= int($2) && int(idxA[2]) <= int($3))
        print idxA[1], idxA[2],  f1[f1iter], $(NF-2), $(NF-1), $NF
   }
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Sorry for the confusion, I want the desired output. I can able to modify the script to get the desired output. But, I have no “nawk” on my Linux box. I have tried with “gawk” and “awk” and the script does not gave me any output using the with the demo file I have provided.

Thanks
 
I want the desired output
So, please, reformulate the rules.
 
Data2 looks like this:

ENSG00000177693 ENST00000326183 1 1 59871 58954 ENSP00000317482 OR4F5 HGNC (curated) uc001aal.1 AL627309 79501
ENSG00000177799 ENST00000327169 1 1 358460 357522 ENSP00000318226 OR4F29 HGNC (curated) uc001aaw.1 BK004219 26683
ENSG00000185097 ENST00000332831 1 -1 611897 610959 ENSP00000329982 OR4F16 HGNC (curated) uc001abd.1 BK004219 26683
ENSG00000197049 ENST00000358533 1 1 712376 711183 ENSP00000351335 AL669831.13 Clone-based (Ensembl) AK290103

Rule:
if ($4 of Data2 ==1) I want to check

if ( data1[$2]<= data2[$5] && data1[$3]<=data2[$6])
{
If it is true:

I want to print: $1 data1, $2 data1 , $3 data1 $4 data2 $5 data2, $6 data2, $8 data2

The final script more complicated than what I have requested:

If ($4 of data2 ==-1)
I want to modify $5 and $6 of data2

$5= (242951149-$6) + 1
$6 = (242951149-$5) + 1

Then I want to test the samething here:

if ( data1[$2]<= data2[$5] && data1[$3]<=data2[$6])
{
If it is true:

I want to print: $1 data1, $2 data1 , $3 data1 $4 data2 $5 data2, $6 data2, $8 data2
}

Thanks



 
assuming data1:
Code:
1       59851   59880   CATTCTAGTGTAAAGTTTTAGATCTTATAT
1       59881   59910   AACTGTGAGATTAATCTCAGATAATGACAC
1       59911   59940   AAAATATAGTGAAGTTGGTAAGTTATTTAG
1       59941   59970   TAAAGCTCATGAAAATTGTGCCCTCCATTC
1       59971   60000   CCATATAATTTAGTAATTGTCTAGGAACTT
1       60001   60030   CCACATACATTGCCTCAATTTATCTTTCAA
1       60031   60060   CAACTTGTGTGTTATATTTTGGAATACAGA
1       60061   60090   TACAAAGTTATTATGCTTTCAAAATATTCT
1       60091   60120   TTTGCTAATTCTTAGAACAAAGAAAGGCAT
1       60121   60150   AAATATATTAGTATTTGTGTACACCTGTTC
1       60151   60180   CTTCCTGTGTGACCCTAAGTTTAGTAGAAG
1       60181   60210   AAAGGAGAGAAAATATAGCCTAGCTTATAA
1       60211   60240   ATTTAAAAAAAAATTTATTTGGTCCATTTT

data2:
Code:
ENSG00000177693    ENST00000326183    1    1    59871    58954    ENSP0000031748
2    OR4F5    HGNC (curated)    uc001aal.1    AL627309    79501
ENSG00000177799    ENST00000327169    1    1    358460    357522    ENSP00000318
226    OR4F29    HGNC (curated)    uc001aaw.1    BK004219    26683
ENSG00000185097    ENST00000332831    1    -1    611897    610959    ENSP0000032
9982    OR4F16    HGNC (curated)    uc001abd.1    BK004219    26683
ENSG00000197049    ENST00000358533    1    1    712376    711183    ENSP00000351
335    AL669831.13    Clone-based (Ensembl)        AK290103

gawk -f demis.awk data1 data2

demis.awk:
Code:
BEGIN {
   OFS=","

   NUM=int(242951149)
}
FNR==NR {
   f1[$2,$3]=$1
   next
}
{
   for (f1iter in f1) {
     split(f1iter, idxA, SUBSEP)
     if ( int($4) == int("1") )
        if (int(idxA[1]) <= int($5) && int(idxA[2]) <= int($6))
           print f1[f1iter], $2, idxA[2] " " $4 " " $5, $6, $8

     if ( int($4) == int("-1") ) {
        $5 = ( NUM - $6) + 1
        $6 = ( NUM - $5) + 1
        if (int(idxA[1]) <= int($5) && int(idxA[2]) <= int($6))
           print f1[f1iter], $2, idxA[2] " " $4 " " $5, $6, $8
     }
   }
}

Not validated - pls do validate.

For the future, please do use code TAGS when posting data or code samples if you want to increase the chances of your posts being answered.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Thank you,

Something wrong with the split(f1iter, idxA, SUBSEP) part.

Result of the script is:

1,ENST00000327169,60240 1 358460,357522,OR4F29
1,ENST00000327169,60030 1 358460,357522,OR4F29
1,ENST00000327169,60090 1 358460,357522,OR4F29
1,ENST00000327169,60120 1 358460,357522,OR4F29
1,ENST00000327169,60000 1 358460,357522,OR4F29
1,ENST00000327169,60180 1 358460,357522,OR4F29
1,ENST00000327169,60210 1 358460,357522,OR4F29
1,ENST00000327169,59940 1 358460,357522,OR4F29
1,ENST00000327169,60060 1 358460,357522,OR4F29
1,ENST00000327169,59880 1 358460,357522,OR4F29
1,ENST00000327169,60150 1 358460,357522,OR4F29
1,ENST00000327169,59910 1 358460,357522,OR4F29
1,ENST00000327169,59970 1 358460,357522,OR4F29
1,ENST00000332831,60240 -1 242340191,610959,OR4F16
1,ENST00000332831,60030 -1 242340191,610959,OR4F16
1,ENST00000332831,60090 -1 242340191,610959,OR4F16
1,ENST00000332831,60120 -1 242340191,610959,OR4F16
1,ENST00000332831,60000 -1 242340191,610959,OR4F16
1,ENST00000332831,60180 -1 242340191,610959,OR4F16
1,ENST00000332831,60210 -1 242340191,610959,OR4F16
1,ENST00000332831,59940 -1 242340191,610959,OR4F16
1,ENST00000332831,60060 -1 242340191,610959,OR4F16
1,ENST00000332831,59880 -1 242340191,610959,OR4F16
1,ENST00000332831,60150 -1 242340191,610959,OR4F16
1,ENST00000332831,59910 -1 242340191,610959,OR4F16
1,ENST00000332831,59970 -1 242340191,610959,OR4F16
1,ENST00000358533,60240 1 712376,711183,AL669831.13
1,ENST00000358533,60030 1 712376,711183,AL669831.13
1,ENST00000358533,60090 1 712376,711183,AL669831.13
1,ENST00000358533,60120 1 712376,711183,AL669831.13
1,ENST00000358533,60000 1 712376,711183,AL669831.13
1,ENST00000358533,60180 1 712376,711183,AL669831.13
1,ENST00000358533,60210 1 712376,711183,AL669831.13
1,ENST00000358533,59940 1 712376,711183,AL669831.13
1,ENST00000358533,60060 1 712376,711183,AL669831.13
1,ENST00000358533,59880 1 712376,711183,AL669831.13
1,ENST00000358533,60150 1 712376,711183,AL669831.13
1,ENST00000358533,59910 1 712376,711183,AL669831.13
1,ENST00000358533,59970 1 712376,711183,AL669831.13

It duplicate one line many times.

 
I don't quote follow the output - what's being repeated?

How do expect to 'relate' records/lines from both files?
What's the common key/fields relating lines/records from both files?

Can you give a very simple sample of both files (1 line each) AND a desired result based on the sample data, please.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Thank you,

I have wrote using Perl this time. I thought it might be easier in awk. I have solved the problem using Perl

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top