Extracting a section based on a unique line

GiovanniC · Nov 5, 2001

Hi all,

I'm still struggling with Gawk and wouldn't mind some help on a 'difficult' (to me anyway) problem. I have a huge (thousands of pages) file that looks something like this:

Query=
>gi12345
Query:
Sbjct:

Query:
Sbjct:

>gi67859
etc
etc

Query=
>gi98765

What I'd like to do is compare the '>gi' lines and if they are the same, to delete the entire section between the repeated '>gi' and the next '>gi'. BUT as in the example above, if the repeated '>gi' is the last one in the 'Query=' block, then it will have to be between the repeated '>gi' and the next 'Query='. And to make it worse, sometimes, the repeated '>gi' does not fall in the same 'Query=' block. Is it possible to do this, or am I asking too much?

An easy way I thought would be to simply delete the repeated '>gi' and the six lines following it.... but I haven't been able to come up with a script. Could someone help?

Giovanni

CaKiwi · Nov 6, 2001

Hi Giovanni,

I suggest that you post a little more of the input data illustrating all the possibilties that you describe and post the output that you require. In the mean time, here is some code which may help you get started. It stops printing when it finds a >gi that is the same as the previous >gi line.

{
if (/^>gi/) {
gnum = substr($0,3)
if (gnum == ghld) {
flg = 1
} else {
flg = 0
ghld = gnum
}
}
if (!flg) print
} CaKiwi

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Extracting a section based on a unique line

GiovanniC

Technical User

CaKiwi

Programmer

Similar threads

Part and Inventory Search

Sponsor