compare lines in two files and extract common line plus more info 2

GiovanniC · Jul 29, 2004

Hi,

I'm trying to do the usual compare two files and pull out the common field, but in addition to that, I want some additional information from file two to be extracted as well. Tried a few different approaches, but think its time to ask the experts.

I've placed an example of the two files at this link

http://wwwdev.scu.edu.au/research/cpcg/question/example.doc

Basically, I want to compare the two files, pull out the common heading (which commences with '>') and also the additional information which is listed in any number of rows below this heading until the next header.

Hope someone has a solution. Many thanks,

Giovanni

iribach · Jul 29, 2004

man comm diff cmp sort

GiovanniC · Jul 29, 2004

iribach

Thanks for the very quick response and your confidence in my abilities ... but I'm afraid I'm not that far advanced to take the suggestion and put it into a workable script :-(

iribach · Jul 29, 2004

giovanni, parli la mia lingua ?
comm fileA fileB
gives you a 3 cols output
col1 items unique in file A
col2 items unique in file B
col3 items common in file A and B
are you on unix?

GiovanniC · Jul 29, 2004

iribach

I assume you asked me what language I speak ... only English unfortunately.

I can compare the two files and pull out the common item, but I don't know how to pull out the extra information from file2. I'm on UNIX.

Thanks

iribach · Jul 29, 2004

yes, a giovanni speaking only english is
seldom.
you have 2 ways:
use native *nix tools piping
g.e: aa|bb|cc|dd
or
write a own prog in awk, perl, c or better.

GiovanniC · Jul 29, 2004

I'm using gawk and I already have the scripts to compare the two files to pull out the common field. The example that I have posted in the document file shows that file2.txt has additional rows of information below the common field. It is this additional information that I am trying to capture and do not know how to write a script for.

PHV · Jul 30, 2004

Something like this ?
awk '
BEGIN{while((getline<"File1.txt")>0)++a[$0]}
/^>/{flg=a[$0]+0}
flg
' File2.txt
Feel free to explain better the expected result

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

GiovanniC · Jul 30, 2004

Thanks PHV, I'll try it out when I get back to work in a couple of weeks - but for now, I'm off to a long awaited holiday

GiovanniC · Aug 8, 2004

Hi PHV,

I've just tried your suggestion but the script doesn't run. Couldn't work out what the problem was. But basically, I have two files. fileA contains the lines:

>TA0001
>TA0002
>TA0006
>TA0008
>TA0012

fileB looks like this:

>TA0001 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATTABGATAGAGTATAGGATAGAGA
TAGATAGATATATAGATATAATATATA
>TA0002 some other information
GGATTAGATCGATGATGATATAGAGTAT
AGATTABGATAGAGTATAGGATAG
>TA0003 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATTABGATAGAGTATAGGATAGAGA
TAGATAGATATATAGATATAATATATA
AGATTABGATAGAGTATAGGATAG
>TA0004 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATAGATTABGA
and so on and so forth in numerical order for the tag lines.

What I want to do is, compare the individual lines from fileA (all start with '>') with the lines in fileB that start with '>'. If the line in fileA matches with $1 of a line in fileB starting with '>', I want to pick out that line from fileB and also the sequence information below that line in fileB until the next line that starts with the '>' character. I hope I am making sense.

Thanks for any help anyone can provide.

mikevh · Aug 9, 2004

This seems to work with the example you posted.
I don't write much awk these days, so this code may not be the most elegant.

Code:

#!gawk

# Get first file into array.
NR == FNR {
    arr[$0] = 1
    next
}
# If $1 exists in array, set mat to 1, else set it to 0.
$1 ~ />/ {
    if (arr[$1]) 
        {mat = 1} 
    else 
        {mat = 0}
}
# Print if mat is 1.
mat

[b]Output:[/b]
>TA0001 some other information
GATAGGATTAGATCGATGATGATAGAGA
TAGATTABGATAGAGTATAGGATAGAGA
TAGATAGATATATAGATATAATATATA
>TA0002 some other information
GGATTAGATCGATGATGATATAGAGTAT
AGATTABGATAGAGTATAGGATAG

HTH

CaKiwi · Aug 9, 2004

Try modifying PHV's script as follows:

BEGIN{while((getline<"File1.txt")>0)++a[$0]}
/^>/{flg=a[$1]}
flg

CaKiwi

GiovanniC · Aug 12, 2004

Thanks everyone who provided suggestions ... the combination of them has helped me solve my little problem.

Cheers

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

compare lines in two files and extract common line plus more info 2

GiovanniC

Technical User

iribach

Technical User

GiovanniC

Technical User

iribach

Technical User

GiovanniC

Technical User

iribach

Technical User

GiovanniC

Technical User

PHV

MIS

GiovanniC

Technical User

GiovanniC

Technical User

mikevh

Programmer

CaKiwi

Programmer

GiovanniC

Technical User

Similar threads

Part and Inventory Search

Sponsor