Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting a section based on a unique line

Status
Not open for further replies.

GiovanniC

Technical User
Nov 5, 2001
15
AU
Hi all,

I'm still struggling with Gawk and wouldn't mind some help on a 'difficult' (to me anyway) problem. I have a huge (thousands of pages) file that looks something like this:

Query=
>gi12345
Query:
Sbjct:

Query:
Sbjct:

>gi67859
etc
etc

Query=
>gi98765

What I'd like to do is compare the '>gi' lines and if they are the same, to delete the entire section between the repeated '>gi' and the next '>gi'. BUT as in the example above, if the repeated '>gi' is the last one in the 'Query=' block, then it will have to be between the repeated '>gi' and the next 'Query='. And to make it worse, sometimes, the repeated '>gi' does not fall in the same 'Query=' block. Is it possible to do this, or am I asking too much?

An easy way I thought would be to simply delete the repeated '>gi' and the six lines following it.... but I haven't been able to come up with a script. Could someone help?

Giovanni
 
Hi Giovanni,

I suggest that you post a little more of the input data illustrating all the possibilties that you describe and post the output that you require. In the mean time, here is some code which may help you get started. It stops printing when it finds a >gi that is the same as the previous >gi line.

{
if (/^>gi/) {
gnum = substr($0,3)
if (gnum == ghld) {
flg = 1
} else {
flg = 0
ghld = gnum
}
}
if (!flg) print
} CaKiwi
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top