Hi all,
I'm still struggling with Gawk and wouldn't mind some help on a 'difficult' (to me anyway) problem. I have a huge (thousands of pages) file that looks something like this:
Query=
>gi12345
Query:
Sbjct:
Query:
Sbjct:
>gi67859
etc
etc
Query=
>gi98765
What I'd like to do is compare the '>gi' lines and if they are the same, to delete the entire section between the repeated '>gi' and the next '>gi'. BUT as in the example above, if the repeated '>gi' is the last one in the 'Query=' block, then it will have to be between the repeated '>gi' and the next 'Query='. And to make it worse, sometimes, the repeated '>gi' does not fall in the same 'Query=' block. Is it possible to do this, or am I asking too much?
An easy way I thought would be to simply delete the repeated '>gi' and the six lines following it.... but I haven't been able to come up with a script. Could someone help?
Giovanni
I'm still struggling with Gawk and wouldn't mind some help on a 'difficult' (to me anyway) problem. I have a huge (thousands of pages) file that looks something like this:
Query=
>gi12345
Query:
Sbjct:
Query:
Sbjct:
>gi67859
etc
etc
Query=
>gi98765
What I'd like to do is compare the '>gi' lines and if they are the same, to delete the entire section between the repeated '>gi' and the next '>gi'. BUT as in the example above, if the repeated '>gi' is the last one in the 'Query=' block, then it will have to be between the repeated '>gi' and the next 'Query='. And to make it worse, sometimes, the repeated '>gi' does not fall in the same 'Query=' block. Is it possible to do this, or am I asking too much?
An easy way I thought would be to simply delete the repeated '>gi' and the six lines following it.... but I haven't been able to come up with a script. Could someone help?
Giovanni