Greetings -
I have been trying to set up a simple routine that tries to accomplish the following:
1. Find line number in a large fasta file with a particular Scaffold/Contig ID, e.g.:
And of course this works just fine giving the file name in the python command statement.
I next want to go to this line number and move x lines down the file from this position, call this the 'target' position.
From the target line I want to capture, e.g., 3 lines upstream and 3 lines downstream of my target line. I've tried numerous ways to get this done but I haven't been able to crack this one after several hours of attempt.
To help explain this better what I am trying to do is grab a DNA sequence around a blast return. For example, consider the following:
> scaffold-1089
Length=522494
Score = 61.6 bits (148), Expect = 2e-09, Method: Compositional matrix adjust.
Identities = 29/31 (94%), Positives = 31/31 (100%), Gaps = 0/31 (0%)
Frame = +3
Sbjct 123273 QLTADSHFMKDLGLDSLDQVEIIMAMEDEFG 123365 << blast hit from Subject query (nucl position from Scaffold ID)
In this blast return I want to see both sides of this hit, generally a few hundred nucleotides on both sides. The fasta file is huge, and as you can see the scaffold is 522,494 bases. So, the idea is first to go to the line number where the scaffold ID is located, and then from there parse down x lines (50 bases at a time per line) until I reach the requisite line number that includes the beginning of my hit (123273). This can be found by dividing 123273/50 since there are 50 bases per line, e.g.,
Returns the line number where the blast hit begins. Now I want to print to stdout not only the blast hit but a few hundred bases on each side of it, so if I want 200 bases up- and downstream I'll capture 4 lines in each direction of the blast hit. I need to do this often as many of my searches requires one to go into large contigs and bring back a narrow range of lines.
In the meantime I am beginning to learn Biopython which I am sure must have something similar. Any help would be appreciated. Thanks in advance - know everyone is busy - but this should prove to be a useful routine for others trying to carry out the same process.
I have been trying to set up a simple routine that tries to accomplish the following:
1. Find line number in a large fasta file with a particular Scaffold/Contig ID, e.g.:
Python:
f = open((sys.argv[1]), 'r')
# Get line number where Scaffold/contig begins..
for num, line in enumerate(f):
if strScaf in line:
intLn = num
break
And of course this works just fine giving the file name in the python command statement.
I next want to go to this line number and move x lines down the file from this position, call this the 'target' position.
From the target line I want to capture, e.g., 3 lines upstream and 3 lines downstream of my target line. I've tried numerous ways to get this done but I haven't been able to crack this one after several hours of attempt.
To help explain this better what I am trying to do is grab a DNA sequence around a blast return. For example, consider the following:
> scaffold-1089
Length=522494
Score = 61.6 bits (148), Expect = 2e-09, Method: Compositional matrix adjust.
Identities = 29/31 (94%), Positives = 31/31 (100%), Gaps = 0/31 (0%)
Frame = +3
Sbjct 123273 QLTADSHFMKDLGLDSLDQVEIIMAMEDEFG 123365 << blast hit from Subject query (nucl position from Scaffold ID)
In this blast return I want to see both sides of this hit, generally a few hundred nucleotides on both sides. The fasta file is huge, and as you can see the scaffold is 522,494 bases. So, the idea is first to go to the line number where the scaffold ID is located, and then from there parse down x lines (50 bases at a time per line) until I reach the requisite line number that includes the beginning of my hit (123273). This can be found by dividing 123273/50 since there are 50 bases per line, e.g.,
Python:
intSeq = (intLn + (intHit/50))+2 # intLn is Scaffold ID line, intHit is beginning base of blast hit
In the meantime I am beginning to learn Biopython which I am sure must have something similar. Any help would be appreciated. Thanks in advance - know everyone is busy - but this should prove to be a useful routine for others trying to carry out the same process.