Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting DNA sequences from GenBank files

Status
Not open for further replies.

akreibich07

Programmer
Jun 23, 2009
5
US
Hi all,

I'm a programmer who uses mainly Java, but have started a research project that involves a lot of data extraction from files. I was told that Perl would be best for this, but I am having a tough time getting started with it.

Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is in the attachment. It is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all the genes and their positions in the species corresponding to NC_001666 (corn). Here is a sample of how the text file is formatted...

rbcL (56874..58304)
atpB -(54618..56114)
atpE -(54208..54621)
trnM (54020..54092)
trnV -(53158..53834)

For example, if in my command prompt I give input of the program name, the species number that I want, and the specific gene from that species whose DNA sequence I want:

perl nucleotide_bases.pl NC_001666 trnM

The program would go into NC_001666.txt, find trnM, see that it has a range from 54020 to 54092 and is on the positive strand(no negative sign). The program then goes into NC_001666.gb, goes to the long list of DNA bases at the bottom and starts at position 54020 and returns all base letters through 54092 (inclusively). So for this specific trnM, the output would be:

gcctacttaactcagtggttagagtattgctttcatacggcgggagtcattggttcaaatccaatagtaggta

If a gene has a negative next to the position range (meaning it's on the negative strand of DNA), the output should be reversed, starting from the higher position, going to the lower. Also, when a negative is there, in that output, all A's should be switched to T's, and all G's to C's and vice versa.

Also, if a gene appears more than once in a text file, give an error message that it appears more than once, and end the program.

If I could get a Perl script to return this information for any species (NC_number) I want, and any gene from that species that I want, it would be a great help in the research I am conducting. Thank you all for your time, and any help on how to write this script would be appreciated.

-akreibich07
 
What about the reply on unix.com? Have you attempted to learn any perl yet to get you started on your project?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Well, this is not a simple assignment, and we are not here to offer a free coding service...
My suggestion is to coopt a colleague knowing about perl or hire a consultant.
Just to let you grasp the depth of the involvement required, this is how I would write the routine for reading the txt file: it returns the range, incl.the minus if present, as a string(untested)
Code:
sub read_txt_file{
  my($plant,$gene)=@_;
  my($found,$range);
  local(*F,$_);
  die"Unable to open $plant.txt"unless open(F,"$plant.txt");
    #assumes .txt is in the current directory
  while(<F>){
    unless(index($_,$gene)){
        #if case insensitive match req'd use a regex instead
      die"Duplicate gene $gene in $plant.txt"if$found;
      $found=1;
      (undef,$range)=split;
    }
  }
  close F;
  return$range;
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Sounded like fun so I had a go at writing it.
Naughty I know but couldn't resist. :)

Code:
#!/usr/bin/perl

use warnings;
use strict;

my ($species, $gene) = @ARGV;
my ($negative, $low, $high, $count);

open my $txtfile, '<', $species . '.txt' or die "Failed to open text file for species [$species]";
while(<$txtfile>) {
    chomp;                                           # Remove trailing linefeed from data
    if( m{
          \A          # Start of record
          $gene       # Gene name
          \s+         # Whitespace
          (-?)        # Optional negative sign
          \(          # Literal open bracket
          (\d+)       # First number
          \.\.        # Two literal dots
          (\d+)       # Second number
          \)          # Literal close bracket
         }xms ) {

        ($negative, $low, $high) = ($1, $2, $3);     # Store values
        $count++;
    }
}
close $txtfile;

# Check our data
die "Gene appeared $count times in the text file!" unless($count == 1);
die "Range of [$low .. $high] does not make sense" unless($low . $high =~ m{^\d+$} and $high > $low);
my $length = $high - $low + 1;

# Collect sequence
open my $gbfile, '<', $species . '.gb' or die "Failed to open gb file for species [$species]";
my $data_area;
my $sequence;
while(<$gbfile>) {
    next unless( $data_area or m{^ORIGIN\s*$});      # Skip file until we see data area
    $data_area = 1;                                  # Flag to indicate data area found
    next unless( m{^\s*\d+(?:\s*[acgt]{10})+\s*$} ); # Skip if the record does not look like gene data
    chomp;                                           # Remove trailing linefeed from data
    s{\s*\d+\s+}{};                                  # Remove leading number
    s{\s+}{}g;                                       # Remove spaces
    $sequence .= $_;                                 # Join to the end of sequence
}

my $dna_base = substr(' ' . $sequence, $low, $length);
die "Failed to collect sufficient data for range requested" unless(length($dna_base) == $length);
if( $negative ) {
    $dna_base = reverse $dna_base;                  # Reverse sequence
#    $dna_base =~ tr[agtc][TCAG];                   # Swap a/t and g/c
}
print lc $dna_base, "\n";                           # Output result.

Let me know if it works for you.

TWB.

Trojan.
 
Hi Trojan,

I had one problem with your code, and I've been trying to fix it myself all morning, but can't figure it out. Could you have a look at it again? I know it has something to do with the $count variable for how many times the gene appears in a text file. I inputted this....

perl DNA_sequence4.pl NC_001666 atpA

I'm sure that atpA only appears once, but this is the error message...

Use of uninitialized value $count in numeric eq <==> at DNA_sequence4.pl line 31.
Use of uninitialized value $count in concatenation <.> or string at DNA_sequence4.pl line 31.
Gene appeared times in the text file! at DNA_sequence.pl line 31.


I especially thought the lack of a number before "times" in the last error message was weird. Other than this, the code looks great.

Do you also know how I would be able to eventually change the code if I would want it to output the sequence order of all the occurences of a given gene instead of giving an error message if there was more than one?

Thanks again!
 
First things first, the blank "times" is because your "atpA" does not occur at all in the ".txt" file.
Please check that again to make sure.
I should have had an error case to handle that to at least make it more readable.

As for your second question, are you saying that you want to output all the matches now and not error if there is more than one?

I'll try to get back to you again soon.



Trojan.
 
Ok, changed as you asked.

Code:
#!/usr/bin/perl

use warnings;
use strict;

my ($species, $gene) = @ARGV;
my @matches = ();

open my $txtfile, '<', $species . '.txt' or die "Failed to open text file for species [$species]";
while(<$txtfile>) {
    chomp;                                           # Remove trailing linefeed from data
    if( m{
          \A          # Start of record
          $gene       # Gene name
          \s+         # Whitespace
          (-?)        # Optional negative sign
          \(          # Literal open bracket
          (\d+)       # First number
          \.\.        # Two literal dots
          (\d+)       # Second number
          \)          # Literal close bracket
         }xms ) {

        my ($negative, $low, $high) = ($1, $2, $3);
        push @matches, {negative => $negative, low => $low, high => $high}; # Add this match to our collection
        die "Range of [$low .. $high] does not make sense" unless($low . $high =~ m{^\d+$} and $high > $low);
    }
}
close $txtfile;

# Check our data
die "Gene of $gene does not appear at all in the text file!" unless(scalar @matches);

# Collect sequence
open my $gbfile, '<', $species . '.gb' or die "Failed to open gb file for species [$species]";
my $data_area;
my $sequence;
while(<$gbfile>) {
    next unless( $data_area or m{^ORIGIN\s*$});      # Skip file until we see data area
    $data_area = 1;                                  # Flag to indicate data area found
    next unless( m{^\s*\d+(?:\s*[acgt]{10})+\s*$} ); # Skip if the record does not look like gene data
    chomp;                                           # Remove trailing linefeed from data
    s{\s*\d+\s+}{};                                  # Remove leading number
    s{\s+}{}g;                                       # Remove spaces
    $sequence .= $_;                                 # Join to the end of sequence
}

foreach my $match (@matches) {
    my $length = $match->{high} - $match->{low} + 1;
    my $dna_base = substr(' ' . $sequence, $match->{low}, $length);
    die "Failed to collect sufficient data for range requested" unless(length($dna_base) == $length);
    if( $match->{negative} ) {
        $dna_base = reverse $dna_base;               # Reverse sequence
        $dna_base =~ tr[agtc][TCAG];                 # Swap a/t and g/c
    }
    print lc $dna_base, "\n";                        # Output result.
}

Please note that the last code I sent you had a line commented out that should not have been:

# $dna_base =~ tr[agtc][TCAG];

If you use the first code then please remove the hash character from the start of the line.

Let me know how you get on.



Trojan.
 
This code is giving me the error message that the gene doesn't appear in the .txt file. So, the program is now working, but I'm inputting genes that I know are for sure in the text file....here is the file NC_001666.txt that I'm using to test it...

Name: Zea mays
FileName: NC_001666
Bases: 140384
Genes: rps12 -(92301..93101)
rps7 -(91772..92242)
ndhB -(89236..91472)
trnL -(88584..88664)
trnI -(84881..84954)
rpl23 -(84433..84714)
rpl2 -(82930..84414)
trnH (82800..82874)
rps19 -(82388..82669)
rpl22 -(81849..82295)
rps3 -(81113..81787)
rpl16 -(79519..80971)
rpl14 -(79038..79409)
rps8 -(78488..78898)
infA -(78085..78408)
rpl36 -(77871..77984)
rps11 -(77252..77683)
rpoA -(76170..77189)
petD (74726..75952)
petB (73141..74543)
psbH (72790..73011)
psbN -(72555..72686)
psbT (72401..72502)
psbB (70706..72232)
clpP -(69554..70204)
rps12 (69307..69420)
rpl20 -(68268..68627)
rps18 (67532..68044)
rpl33 (66998..67198)
psaJ (66513..66641)
trnP -(66056..66129)
trnW -(65845..65918)
petG (65611..65724)
petL (65352..65447)
psbE -(63864..64115)
psbF -(63734..63853)
psbL -(63595..63711)
psbJ -(63347..63469)
petA (61485..62447)
cemA (60553..61245)
ycf4 (59666..60223)
psaI (59193..59303)
rbcL (56874..58304)
atpB -(54618..56114)
atpE -(54208..54621)
trnM (54020..54092)
trnV -(53158..53834)
ndhC -(51855..52217)
ndhK -(51118..51801)
ndhJ -(50535..51014)
trnF (49872..49944)
trnL (48966..49508)
trnT -(48081..48153)
rps4 -(47157..47762)
trnS (46790..46876)
ycf3 -(44213..46192)
psaA -(41352..43604)
psaB -(39119..41326)
rps14 -(38663..38974)
trnR -(38228..38299)
atpA (36573..38096)
atpF (35097..36479)
atpH (34381..34626)
atpI (32820..33563)
rps2 (31858..32568)
rpoC2 (26983..31566)
rpoC1 (24733..26784)
rpoB (21468..24695)
trnC -(20125..20195)
petN -(19081..19170)
psbM (18178..18282)
trnD (17053..17126)
trnY (16621..16704)
trnE (16487..16559)
trnT (15954..16025)
trnT (15887..15958)
trnG -(13245..14013)
trnfM -(13073..13146)
trnG (12509..12579)
psbZ (12017..12205)
trnS -(11669..11756)
psbC (10092..11513)
psbD (9083..10144)
trnS -(8012..8099)
psbI (7775..7885)
psbK (7199..7384)
trnQ -(6773..6844)
rps16 -(4491..5604)
matK -(1674..3215)
trnK -(1386..3946)
psbA -(89..1150)
rps19 (140068..140349)
trnH -(139863..139937)
rpl2 (138323..139807)
rpl23 (138023..138304)
trnI (137783..137856)
trnL (134073..134153)
ndhB (131265..133501)
rps7 (130495..130965)
trnV -(127807..127878)
trnI -(124764..125784)
trnA -(123821..124699)
trnR -(119925..119998)
trnN (119600..119671)
rps15 -(117772..118008)
ndhH -(116456..117637)
ndhA -(114343..116454)
ndhI -(113707..114249)
ndhG -(112993..113523)
ndhE -(112473..112778)
psaC -(111760..112005)
ndhD -(110138..111640)
ccsA (108995..109960)
trnL (108837..108917)
rpl32 (108127..108306)
ndhF -(105072..107288)
rps15 (104729..104965)
trnN -(103066..103137)
trnR (102739..102812)
trnA (98038..98916)
trnI (96953..97973)
trnV (94859..94930)

I tried atpA, rps12, atpB....genes that I know are in there. This NC_001666.txt is saved to my computer, so I think the program must be going into the wrong file...I don't know. Any suggestions, Trojan? Thanks.
 
Trojans code also does not appear to work when I tried it. It runs but does not find any matches. But I am not inclined to try and figure out why.

Nice to "see" you again TWB [smile]

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Just needs the regex modified slightly to handle the white space at the beginning of each line, and the special case of the first line which has the extra "Genes:" text on the beginning.

Annihilannic.
 
Adding the new parentheses meant the references in the subsequent assignment also had to change.

Code:
    if( m{
          \A          # Start of record
          [COLOR=red](Genes:)?   # Optional field name
          \s+         # Whitespace[/color]
          $gene       # Gene name
          \s+         # Whitespace
          (-?)        # Optional negative sign
          \(          # Literal open bracket
          (\d+)       # First number
          \.\.        # Two literal dots
          (\d+)       # Second number
          \)          # Literal close bracket
         }xms ) {

        my ($negative, $low, $high) = ($[COLOR=red]2[/color], $[COLOR=red]3[/color], $[COLOR=red]4[/color]);

Annihilannic.
 
akreibich07 you should look in the NCBI FTP folder (ftp://ftp.ncbi.nih.gov/genomes/). GenBank provides the information you are looking for, as ready to use files.

alternately, if you know the accession numbers for all your sequences, use the batch download feature.
 
Thank you Trojan and Annihilannic for the code, and to 3inen, I will look into that as a method....thank you.

-akreibich07
 
Sorry I was away for a while there guys.

Are we all fixed and working now then?

KevinADC: Good to see you again. :)



Trojan.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top