Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

lines to a hash: skipping doubles using loop control

Status
Not open for further replies.

Timtico

Technical User
Dec 20, 2008
8
NL
Hello guys,

On a bioinformatics related project I want to read in the contents of a file (fasta format) to a fastafile. The file looks for example as follows:

Code:
>1
sequence1
>2
sequence2
>3
sequence3
>1
sequence1

Now, I'm building a hash using the following code, works perfectly fine.. However, the problem is that if I read in files that contain duplicate headers with sequences (like in the sample above) the value of the second duplicate is added to the first. Which would lead to key:1 value:sequence1sequence1 (double the sequence1)

Code:
open (FASTA, $fasta_filename) or die ("\n\ncould not open $fasta_filename");

LINE: while ($line = <FASTA>)
{
next LINE if $line =~ m/^#/; #discard comments

if ($line =~ /^>/) #if line starts with a >
	{
   		$line =~ s/[\x0A\x0D]+//g; #removing those ugly whitelines and returns, bleh
	$line =~ s/(\n)(\r)//g; #remove them alllll!
	$line =~ m/(^>\w+)/i; #matches a word starting with > (Fasta)
	$hash_key = substr($1, 1) #removes first character from the string to yield the clean gen
    	}
			
	else #if the line doesnt contain a FASTA header the line is added to the value of the hash instead
{
	$line =~ s/[\x0A\x0D]+//g; 
	$line =~ s/(\s+)(\n)(\r)//g;
	$fasta_hash{$hash_key}.=$line; #if the line doesnt contain a FASTA header the line is added to the value of the hash
}
}

Now what I want to do:
- read in the lines, make a key if the line starts with a >, assign all the following lines until the next line that contains a > to the value... I know it should be relatively easy using loop controls, but I can't figure out how to do it as I do not fully understand loop controls.. Right after I read in a line that contains > I want to check if that value is already present in the hash using something like this:
Code:
if exists ($fasta_hash{$hash_key}) && do {next LINE} ;


Now I came up with something like this (but it doesnt work):

Code:
LINE: while ($line = <FASTA>)
{
next LINE if $line =~ m/^#/; #discard comments, seems to be working
$line =~ s/[\x0A\x0D]+//g; 
$line =~ s/(\s+)(\n)(\r)//g;

if ($line =~ m/(^>\w+)/i)
{
$hash_key = "";
$hash_key = substr($1, 1) #removes first character from the string to yield the clean gen
if exists ($fasta_hash{$hash_key}) && do {next LINE} ;
while ($line !~ m/(^>\w+)/i)
{
	$fasta_hash{$hash_key}.=$line; #if the line doesnt contain a FASTA header the line is added to the value of the hash
}}}

Can anybody help me out and enlighten me on how I should use loop control in this case?
 
Code:
open(FASTA,$fasta_filename)or die"\n\ncould not open $fasta_filename";
while($line=<FASTA>){
  chomp;
  $line=~s/^\s+//;
  if(ord$line==62){
        #if line starts with a >
    $line=~m/^>(\w+)/; #matches a word starting with > (Fasta)
    $hash_key=$1;
    $sequence='';
    while($line=<FASTA>){
      chomp;
      $line=~s/^\s+//;
      next if$line=~m/^#/; #discard comments
      last if ord$line==62;
      $line=~s/\s+$//;
      $sequence.=$line;
         #if the line doesnt contain a FASTA header the line is added to the value of the sequence
    }
    $fasta_hash{$hash_key}=$sequence unless exists$fasta_hash{$hash_key};
    redo unless eof(FASTA);
  }
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Thanks alot! that code seems to work, I realized while planning the code that I needed to make a construction like you made, get into a loop after finding a ">" filling the hash and breaking out of the loop when finding the next ">". However, I failed at that since I didnt realize I can nest a while loop in a while loop with the same expression.. ofcourse that is possible!

I made some minor adjustments... your chomp didnt work since i did
Code:
while($line = FASTA)
instead of
Code:
while(FASTA)
, so I did
Code:
chomp $line

Also the
Code:
ord$line==62
, using the ASCII code for > was abit confusing and i replaced it with
Code:
 =~ m/^>/
. Why do you use ord()? so it can recognize the ">" in different text file formats?
 
Ah ok... by using ord() on a string, is the string converted to a scalar, resulting in only the first character of the string?. Ahwell, I could test it myself ofcourse :)

I will keep it's speedperformance in mind. Perhaps one day I want to search in the whole genome, and will build a hash using flatfiles of hundreds of MB.

Thanks for the help again, I know what place I will return to when I'm stuck.
 
Perl is a popular choice for manipulating these FASTA files, and there are several pages of modules already up on CPAN. It is entirely possible that someone has already done the legwork for you...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top