Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

compare lines in two files and extract common line plus more info 2

Status
Not open for further replies.

GiovanniC

Technical User
Nov 5, 2001
15
AU
Hi,

I'm trying to do the usual compare two files and pull out the common field, but in addition to that, I want some additional information from file two to be extracted as well. Tried a few different approaches, but think its time to ask the experts.

I've placed an example of the two files at this link

Basically, I want to compare the two files, pull out the common heading (which commences with '>') and also the additional information which is listed in any number of rows below this heading until the next header.

Hope someone has a solution. Many thanks,

Giovanni
 
iribach

Thanks for the very quick response and your confidence in my abilities ... but I'm afraid I'm not that far advanced to take the suggestion and put it into a workable script :-(
 
giovanni, parli la mia lingua ?
comm fileA fileB
gives you a 3 cols output
col1 items unique in file A
col2 items unique in file B
col3 items common in file A and B
are you on unix?
 
iribach

I assume you asked me what language I speak ... only English unfortunately.

I can compare the two files and pull out the common item, but I don't know how to pull out the extra information from file2. I'm on UNIX.

Thanks

 
yes, a giovanni speaking only english is
seldom.
you have 2 ways:
use native *nix tools piping
g.e: aa|bb|cc|dd
or
write a own prog in awk, perl, c or better.
:)
 
I'm using gawk and I already have the scripts to compare the two files to pull out the common field. The example that I have posted in the document file shows that file2.txt has additional rows of information below the common field. It is this additional information that I am trying to capture and do not know how to write a script for.
 
Something like this ?
awk '
BEGIN{while((getline<"File1.txt")>0)++a[$0]}
/^>/{flg=a[$0]+0}
flg
' File2.txt
Feel free to explain better the expected result

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
Thanks PHV, I'll try it out when I get back to work in a couple of weeks - but for now, I'm off to a long awaited holiday :)
 
Hi PHV,

I've just tried your suggestion but the script doesn't run. Couldn't work out what the problem was. But basically, I have two files. fileA contains the lines:

>TA0001
>TA0002
>TA0006
>TA0008
>TA0012

fileB looks like this:

>TA0001 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATTABGATAGAGTATAGGATAGAGA
TAGATAGATATATAGATATAATATATA
>TA0002 some other information
GGATTAGATCGATGATGATATAGAGTAT
AGATTABGATAGAGTATAGGATAG
>TA0003 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATTABGATAGAGTATAGGATAGAGA
TAGATAGATATATAGATATAATATATA
AGATTABGATAGAGTATAGGATAG
>TA0004 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATAGATTABGA
and so on and so forth in numerical order for the tag lines.

What I want to do is, compare the individual lines from fileA (all start with '>') with the lines in fileB that start with '>'. If the line in fileA matches with $1 of a line in fileB starting with '>', I want to pick out that line from fileB and also the sequence information below that line in fileB until the next line that starts with the '>' character. I hope I am making sense.

Thanks for any help anyone can provide.


 
This seems to work with the example you posted.
I don't write much awk these days, so this code may not be the most elegant.
Code:
#!gawk

# Get first file into array.
NR == FNR {
    arr[$0] = 1
    next
}
# If $1 exists in array, set mat to 1, else set it to 0.
$1 ~ />/ {
    if (arr[$1]) 
        {mat = 1} 
    else 
        {mat = 0}
}
# Print if mat is 1.
mat

[b]Output:[/b]
>TA0001 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATTABGATAGAGTATAGGATAGAGA
TAGATAGATATATAGATATAATATATA
>TA0002 some other information
GGATTAGATCGATGATGATATAGAGTAT
AGATTABGATAGAGTATAGGATAG
HTH




 
Try modifying PHV's script as follows:

BEGIN{while((getline<"File1.txt")>0)++a[$0]}
/^>/{flg=a[$1]}
flg

CaKiwi
 
Thanks everyone who provided suggestions ... the combination of them has helped me solve my little problem.

Cheers
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top