Greetings - I have been trying 1

Isadore · Jan 15, 2013

Greetings -

I have been trying to set up a simple routine that tries to accomplish the following:

1. Find line number in a large fasta file with a particular Scaffold/Contig ID, e.g.:

Python:

f = open((sys.argv[1]), 'r')
# Get line number where Scaffold/contig begins..
for num, line in enumerate(f):
	if strScaf in line:
                intLn = num
                break

And of course this works just fine giving the file name in the python command statement.

I next want to go to this line number and move x lines down the file from this position, call this the 'target' position.

From the target line I want to capture, e.g., 3 lines upstream and 3 lines downstream of my target line. I've tried numerous ways to get this done but I haven't been able to crack this one after several hours of attempt.

To help explain this better what I am trying to do is grab a DNA sequence around a blast return. For example, consider the following:

> scaffold-1089
Length=522494

Score = 61.6 bits (148), Expect = 2e-09, Method: Compositional matrix adjust.
Identities = 29/31 (94%), Positives = 31/31 (100%), Gaps = 0/31 (0%)
Frame = +3

Sbjct 123273 QLTADSHFMKDLGLDSLDQVEIIMAMEDEFG 123365 << blast hit from Subject query (nucl position from Scaffold ID)

In this blast return I want to see both sides of this hit, generally a few hundred nucleotides on both sides. The fasta file is huge, and as you can see the scaffold is 522,494 bases. So, the idea is first to go to the line number where the scaffold ID is located, and then from there parse down x lines (50 bases at a time per line) until I reach the requisite line number that includes the beginning of my hit (123273). This can be found by dividing 123273/50 since there are 50 bases per line, e.g.,

Python:

intSeq = (intLn + (intHit/50))+2  # intLn is Scaffold ID line, intHit is beginning base of blast hit

Returns the line number where the blast hit begins. Now I want to print to stdout not only the blast hit but a few hundred bases on each side of it, so if I want 200 bases up- and downstream I'll capture 4 lines in each direction of the blast hit. I need to do this often as many of my searches requires one to go into large contigs and bring back a narrow range of lines.

In the meantime I am beginning to learn Biopython which I am sure must have something similar. Any help would be appreciated. Thanks in advance - know everyone is busy - but this should prove to be a useful routine for others trying to carry out the same process.

Isadore · Jan 15, 2013

Sorry - forgot to put a header on this thread!

Bong · Jan 16, 2013

If we accept your proposition that the input file is "huge", let's say it's too big to read in to memory. So, you can read line by line until you find what you're looking for. It looks like the search string is "strScaf" so:

Code:

for strLine in f:
    if strScaf in strLine: break

Now you can figure out how many lines you need to go back. Let's say it's "n". And further, you need also n lines after so 2*n in total. Now it would help if you knew how many bites were in each line,lb, but alternatively, you can guess high and then read to the next linefeed.

Code:

f.seek(-n*lb)
lstBase=[]
for i in xrange(2*n): lstBase.append(f.readline())

or, you know, something like that.

_________________
Bob Rashkin

Isadore · Jan 17, 2013

Bob - thanks for the reply - still working on this - getting there... will post back when I get this solved - just wanted to say thanks the help...

Isadore · Jan 17, 2013

Bob -

Here is where I am with this - and I'll try to explain in detail what I am trying to do. Any help would be appreciated as I would use this routine quite a bit.

Below is my test file, a fasta file, with two contigs:

Python:

>contigA
AGCTGGGGAGAAAAAAAAGACTAGCTAGAGATTTTAGAAAAACAGGTAGAGTCTTGGGGAAGGAAATGAAATTTAGAAC
GTGCAAAGAATCCACTGGACTGAGGGAGTTTTGTCATGAACTGGGAAGCAGCAGTAAAGTCCAGTTCCACCTCAGCAGT
CCACGAGCTGGGCTTCACCGGCAGACCTGCTGTCCCCCACCCCAGGCCAGTACCTGTGCTGTCACCAGCAGGACAGGAT
CAAGTTCTTTTATTAGATTTTTATACTGGTCTGAGATAGTCAAGTAACCAAGTGACTGCCAGAATTCAGCAATGAGAAG
GTTCACCCAGATGCTGTGCACACCTGTCGTGGTCGGGACCAGCAGTTCCCAGAGCTGATGGCAAGGCACGGGGAGGGCA
GCACTCTGCACCCAGCTCTGTCTAAGGGAGAGCTCCAGAGGCGCTGAGCAGCTCTGCGGAGCGTGAAAGGGAAATGCTG
TTGGACACAGGAAGAAACGAACGGAGGGGTAGAAATGAACTTTGTGATGGTCTGATAGTCAGAGACTTTTCAAAGGAAA
AGGCCTTTACGCTCATTGAGTCCCCCCCGTTCCCCCAGCACTGCCAGGTCCCCACTGCCCCCTGTCCCTCAGCGCCACC
TCGACACAAGGGCTGGGAAACCCCTCCAGGGATGGGGACTCCACCACTGCCCTGGGCAGCCTGGGACACACCTGGACAA
CCCTAACGGGACAGAAATGGTTCCTCAGGGACAACCTCAACCTCCCCTGGTGCAAATTGAGGCTGTTCCCTCTTGTCC
>contigB
AGTGGAATGTTTAGGTGCTTGTGTAAACGCACCGATGGTACAAATCAATGACAATTACTACGTAAGTACGAGAACCATT
TTAACCCATTCCATTTCCTTGCTGGGTCACCTGTATCTCCTTTTAGGACCTCACCAAAGCGAACCACTTGTCTGATGAA
CTTTGGCCGCTGCAGTTAGTGCAGCTGCTGACTCTGTTTTGTCAATGATTTTTTTACAGGAAGATTTGA[highlight #FCE94F]CACCCAAAGA
TATTGAAGACATAATTGACGAGCTGAAGGC[/highlight][highlight #729FCF]TGGCAAAGTTCCCAAACCGG[/highlight][highlight #FCE94F]GCCCAAGGTAGGTGCTGTTTGGGTAGTGC
TGGCCAGCTGC[/highlight]TCTGCTGGAAGGCTGAGCTTCTGAGGTTCCTGTGCCACGGAGCTGAGCTCCCAACCTACAGCTAAGGA
AATTCATTTACTGACCAGGATGGCACTGCTCAGTGTCTTTCTAGAGAGTAAAGCAAAAAAAAATTGATAGAACTGAGGA
TTTCCATGTTCTTGATGAAATGTGAAAACACCAGTCCGGATTTGGTGCAGTACCTGACAGTTGACTTTTTTTTTTGTCA
AGATCATGGAAATCAACAAATGCCTGTTGTTGTGGGTTCTTTCTGACAAACACCTTGTGAATGGGTTCTTCTCCTGAAC
AGGCCTTTACGCTCATTGAGTCCCCCCCGTTCCCCCAGCACTGCCAGGTCCCCACTGCCCCCTGTCCCTCAGCGCCACC
TCGACACAAGGGCTGGGAAACCCCTCCAGGGATGGGGACTCCACCACTGCCCTGGGCAGCCTGGGACACACCTGGACAA
CCCTAACGGGACAGAAATGGTTCCTCAGGGACAACCTCAACCTCCCCTGGTGCAAATTGAGGCTGTTCCCTCTTGTCC

The above is my test file, croc.fa. So to exexcute the python script I type in: python exSeq.py croc.fa
The blue highlighted sequence is my blast hit. The yellow region around the blast his is the sequence I want to send to stdout.
I will be working with fasta files that are typically ~ 1.2 Gb.

The design plan is this:
1. First move to the line containing the contig ID that includes my blast hit (e.g., "contigB" in the above test file).
2. Next, using seek(), move from the current contig position followed by a move in the reverse direction (bytes in front of hit highlighted yellow).
3. Now read the bytes into a list to print (as you suggested) until you get to the end of the range, then stop.
4. print the range

Here the python file I an working with now, using your suggestions but abandoning the "line" approach for bytes.

Python:

#! /usr/bin/python

import sys

print "exSeq - extract DNA in the neighborhood of a blast hit"

# Enter the scaffold/contig in which the blast hit resides
strScaf = raw_input("Enter scaffold/contig ID containing blast hit: ")

# Enter value of the initial position of the blast hit
t = raw_input("Enter the initial bp position of blast hit: ")
t = int(t)

# Enter value of the final base position of the blast hit
u = raw_input("Enter the final bp position of blast hit: ")
u = int(u)

# Enter range of bases upstream and downstream of the hit
q = raw_input("Enter no. of bp's on each side of blast hit: ")
q = int(q)

#open file for reading
f = open((sys.argv[1]), 'r')

# Move to the line with the scaffold/contig ID
for strLine in f:
	if strScaf in strLine: 
             break

# move to position of initial blast hit position from the current position
f.seek(t-q,1)

# read bytes through range 2*q + (u-t)
lstBase=[]
for x in xrange(2*q+(u-t)): lstBase.append(f.read())
print lstBase

f.close()

I realize before this is over with I'll have to deal with the "/n" file endings, etc. Running this file (blast hit at 280, range on each side 35, hit length 20) produces the following output:

Python:

exSeq - extract DNA in the neighborhood of a blast hit

Enter scaffold/contig ID containing blast hit: scaffoldB
Enter the initial bp position of blast hit: 50
Enter the final bp position of blast hit: 80
Enter no. of bp's on each side of blast hit: 10
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Not sure where all the [apostrophies] are coming from, I suppose at the end of the file? At any rate, this is where I am at the moment. Any ideas wd be appreciated - and again thanks for your help.

Isadore · Jan 17, 2013

Ok, found a working solution.

Python:

#! /usr/bin/env python

# A typical blast hit is shown below.  The program is designed to return
# the blast hit in a large fasta file along with bp's up and downstream
# of this hit.

#> scaffold-1089 Length=522494

#Score = 61.6 bits (148),  Expect = 2e-09, Method: Compositional matrix adjust.
#Identities = 29/31 (94%), Positives = 31/31 (100%), Gaps = 0/31 (0%)
#Frame = +3

#Query  85      KLTAESHFMKDLGLDSLDQVEIIMAMEDEFG  115
#               +LTA+SHFMKDLGLDSLDQVEIIMAMEDEFG
#Sbjct  123273  QLTADSHFMKDLGLDSLDQVEIIMAMEDEFG  123365

# The scaffold/contig ID is given, followed by the range of the blast hit
# here, 123365-123272, and up- and downstream bp's are added by user input
# to return a window of bases around the blast hit.

import sys
import fileinput
LineNumber = 0

print
print "exSeq - extract DNA in the neighborhood of a blast hit"
print
# Enter the scaffold/contig in which the blast hit resides
strScaf = raw_input("Enter scaffold/contig ID containing blast hit: ")

# Enter fasta file line length
s = raw_input("Enter fasta line length: ")
s = int(s)

# Enter value of the initial position of the blast hit
t = raw_input("Enter the initial bp position of blast hit: ")
t = int(t)

# Enter number of lines upstream and downstream of the hit you want to retrieve
q = raw_input("Enter no. of lines to retrieve on each side of blast hit: ")
q = int(q)

# Move to the line with the scaffold/contig ID
# print cursor position, lineno

#open file for reading
f = open((sys.argv[1]), 'r')

for line in fileinput.input():
    if strScaf in line:
       LineNumber = fileinput.lineno()
fileinput.close()

for line in fileinput.input():
    if fileinput.lineno() >= LineNumber+((t/s)+1-q) and fileinput.lineno() <= LineNumber+((t/s)+1+q): 
        print line
fileinput.close()

Bob - this works pretty well - takes about 1 min on a 2 gig file to return a half dozen lines of sequence. The only drawback but perhaps std in Python is that I first have to read through the file to grab the file line no. of the query scaffold/contig and then go from there with another round of iteration.

Isadore · Jan 17, 2013

Left 2 old comments in the file, these lines are not functional....

# Move to the line with the scaffold/contig ID
# print cursor position, lineno

Bong · Jan 18, 2013

You could probably simplify this a little if all contigs (I'll take your word for it that that's a word) are the same size. Then you can read in that number of bytes after the ID line. Then remove the "\n"s (<string>=<string>.replace("\n","")). Now you can just slice the string: i=<string>.find(strScaf); print <string>[i-n1:i+n2]

_________________
Bob Rashkin

Isadore · Jan 18, 2013

Bob -

I really appreciate your input. In both cases there was something there to learn. I really like python - but in this case perhaps I posted a little too early - but I was stuck in the beginning - not thinking python is chronological, etc as it reads through the fasta file. Contigs are de novo assembled DNA from 100 bp pieces into overlapping neighbors that, in the end, may be 100,000 bp large, varying quite a bit (Trinity is one of the current top level assemblers).

The code below seems to be behaving. The last code I posted had several errors (my fault on that one). I am going to copy your suggestion and give it a try.

Thanks again Bob - I really appreciate your time, does make a diff. I'm in the older crowd but I must say I love the interactions on the net where people support one another - great stuff.

Python:

#! /usr/bin/env python

# A typical blast hit is shown below.  The program is designed to return
# the blast hit in a large fasta file along with bp's up and downstream
# of this hit.

#> scaffold-1089 Length=522494

#Score = 61.6 bits (148),  Expect = 2e-09, Method: Compositional matrix adjust.
#Identities = 29/31 (94%), Positives = 31/31 (100%), Gaps = 0/31 (0%)
#Frame = +3

#Query  85      KLTAESHFMKDLGLDSLDQVEIIMAMEDEFG  115
#               +LTA+SHFMKDLGLDSLDQVEIIMAMEDEFG
#Sbjct  123273  QLTADSHFMKDLGLDSLDQVEIIMAMEDEFG  123365

# The scaffold/contig ID is given, followed by the range of the blast hit
# here, 123365-123272, and up- and downstream bp's are added by user input
# to return a window of bases around the blast hit.

import sys
import fileinput
w=0
i=0

print
print "exDna.py - extract DNA in the neighborhood of a blast hit"
print
# Enter the scaffold/contig in which the blast hit resides
strScaf = raw_input("Enter scaffold/contig ID containing blast hit: ")

# Enter fasta file line length
s = raw_input("Enter fasta line length: ")
s = int(s)

# Enter value of the initial position of the blast hit
t = raw_input("Enter initial bp position of blast hit: ")
t = int(t)

# Enter value of the final position of the blast hit
u = raw_input("Enter final bp position of blast hit: ")
u = int(u)

# create reference for upstream bases
w=u-t
w=int(w)

# Enter number of lines upstream and downstream of the hit you want to retrieve
q = raw_input("Enter no. of lines to retrieve on each side of hit: ")
q=int(q)

#open file for reading
f = open((sys.argv[1]), 'r')

# find the scaffold and capture Line no, use counter
for line in fileinput.input():
    if strScaf in line:
       LineNumber = fileinput.lineno()
       i+=1
fileinput.close()

# print lines found between requested bases
for line in fileinput.input():
    if fileinput.lineno() >= LineNumber+((t/s)+1-q) and fileinput.lineno() <= LineNumber+(((t+w)/s)+1+q): 
        print line
fileinput.close()

# close script and exit
raise SystemExit

Bong · Jan 20, 2013

Glad to help. By the way, I'm in the older crowd, too (born during the Truman administration).

_________________
Bob Rashkin

Isadore · Jan 21, 2013

Hey Bob - Thanks again. This routine was something I really needed because I do a lot of blasting against raw genomes. What that means is I have a lot of large fasta files, with hundreds of thousands of scaffolds and contigs with a file size > 2 gigs.

A typical fasta file has the following format:

> mySpecies chromosome 12:199898-188987 ...et al..info..
ACTCTTTGTTTAATGTGATAGCTGTTTGGCATGATTACGACTGGGCCAGAGCTCTG
TTTGCATTTCCACTGGGTGAGGATTTGTGGATAGGGAAGAGAGTAGCTACACTTCA
GACCTCCCTCTATGCTCATCCGGTCGAGCATGTGGTAAGCGGGGAGAGTAGGAGAG
....may go on for 7 megs worth of DNA
> mySpecies chromosome 1:8373734-943433
ACTCTTTGTTTAATGTGATAGCTGTTTGGCATGATTACGACTGGGCCAGAGCTCTG
TTTGCATTTCCACTGGGTGAGGATTTGTGGATAGGGAAGAGAGTAGCTACACTTCA
GACCTCCCTCTATGCTCATCCGGTCGAGCATGTGGTAAGCGGGGAGAGTAGGAGAG
...
...
100,000+ fastas, some small, others long
...
...

So, here's the deal Bob. When I use blast+ to query a particular protein or DNA sample I have I can find it with blast+ but many times I need to see what is up- and downstream of this hit. The alignment that comes back from the blast+ run looks like this:
...
...
...
Sbjct 123273 QLTADSHFMKDLGLDSLDQVEIIMAMEDEFG 123365

Now, these are amino acids which represent the underlying DNA. Most of the time I need to see upstream and downstream of this hit but as you can imagine it is very time consuming to try to do this the long way, hence I needed this script to yank out the local environment of the hit from within 2+ gigs of "....GGATTTGTGGATAGGGAAGAGAGTAGC....".

Here is my final script - works like a charm - and very surprisingly it is fast. It takes < 5 seconds to strip out the DNA around the hit + the hit so I can futher analyze it.

Python:

#! /usr/bin/env python

# This script extracts a local region around a blast+ hit from a fasta
# file containing large scaffolds or contigs.

# import classes
import sys
from sys import exit

# initialize
x=0
z=0
e=0
strQ=""
strdna=""
strID=""
strOp=""
str2Opt=""
strStp=""
strSC=""

print
print  "exDna.py - extract DNA in the neighborhood of a blast hit"
print
print  "The following routine will extract a specific scaffold or contig"
print  "from the underlying " + sys.argv[1] + " file. This procedure will remove"   
print  "all white space, headers and linebreaks leaving a single string"
print  "with DNA only. This will facilitate cursor movment during the"
print  "querying process."
print  

# loop for consecutive queries on same scaffold/contig
while True:
	if e==0:
		# Enter the scaffold/contig ID for blast hit resides
		strID = raw_input("Enter scaffold/contig ID containing blast hit: ")

        if e==0 or e==3:  # read and process new file/scaffold
		# read the scaffold into a string variable after stripping header and line returns                            
		f = open((sys.argv[1]), 'r')
		in_range = False
       		e=1  # by-pass ID entry for..if below
		for line in f:
			if strID in line:
				in_range = True
                 		pass  # drop header line                       
			elif in_range == True:
				if ">" in line:
					break
				else:
                       			strdna = line
		 			strdna = strdna.replace("\n", "")
                        		strSC = strSC + strdna
                	else:
				pass

	# Enter value of the initial position of the blast hit
	t = raw_input("Enter initial bp position of blast hit: ")
	t = int(t)

	# Enter value of the final position of the blast hit
	u = raw_input("Enter final bp position of blast hit: ")
	u = int(u)

	# Enter no. of bp's up- and downstream of the hit you want to retrieve
	q = raw_input("Enter no. bp's to get up- and downstream: ")
	q=int(q)
        print

	# trim and print..
        strQ = strSC                   # reload strQ with strSC holding scaffold
	z=len(strQ)                    # len of file for right trim
	z=z-u-q                        # right trim limit
	t=t-q                          # left trim limit
	strQ = strQ[:-z]               # trim right
	strQ = strQ[t:]                # trim left
	print strQ                     # send qy string to stdout
	print

	# Continue with more queries?"
        strStp = raw_input("Do you want to continue? [Y/n]: ")
        if strStp =="N" or strStp == "n":
                print
        	exit()                      # exit script
        print
	strOpt = raw_input("Would you like to run another query from " + sys.argv[1] + "? (Y/n): ")
        print
	if strOpt == "N" or strOpt == "n":
                e=3
		# Enter the name of new fasta file
		strFile = raw_input("Enter name/path to new fasta file: ")
                print
                # Enter the scaffold/contig ID for blast hit resides
		strID = raw_input("Enter scaffold/contig ID containing blast hit: ")
		sys.argv[1] = strFile
                print

Thanks for your time Bob, best luck to you and thanks for taking the time. During my building of the script I was able to test and re-test your suggestions. I loved the listbox but fileinput seemed to have some issues with it so I went with appending to a string. Talk to you later.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Greetings - I have been trying 1

Isadore

Technical User

Isadore

Technical User

Bong

Programmer

Isadore

Technical User

Isadore

Technical User

Isadore

Technical User

Isadore

Technical User

Bong

Programmer

Isadore

Technical User

Bong

Programmer

Isadore

Technical User

Similar threads

Part and Inventory Search

Sponsor