lines to a hash: skipping doubles using loop control

Timtico · Dec 20, 2009

Hello guys,

On a bioinformatics related project I want to read in the contents of a file (fasta format) to a fastafile. The file looks for example as follows:

Code:

>1
sequence1
>2
sequence2
>3
sequence3
>1
sequence1

Now, I'm building a hash using the following code, works perfectly fine.. However, the problem is that if I read in files that contain duplicate headers with sequences (like in the sample above) the value of the second duplicate is added to the first. Which would lead to key:1 value:sequence1sequence1 (double the sequence1)

Code:

open (FASTA, $fasta_filename) or die ("\n\ncould not open $fasta_filename");

LINE: while ($line = <FASTA>)
{
next LINE if $line =~ m/^#/; #discard comments

if ($line =~ /^>/) #if line starts with a >
	{
   		$line =~ s/[\x0A\x0D]+//g; #removing those ugly whitelines and returns, bleh
	$line =~ s/(\n)(\r)//g; #remove them alllll!
	$line =~ m/(^>\w+)/i; #matches a word starting with > (Fasta)
	$hash_key = substr($1, 1) #removes first character from the string to yield the clean gen
    	}
			
	else #if the line doesnt contain a FASTA header the line is added to the value of the hash instead
{
	$line =~ s/[\x0A\x0D]+//g; 
	$line =~ s/(\s+)(\n)(\r)//g;
	$fasta_hash{$hash_key}.=$line; #if the line doesnt contain a FASTA header the line is added to the value of the hash
}
}

Now what I want to do:
- read in the lines, make a key if the line starts with a >, assign all the following lines until the next line that contains a > to the value... I know it should be relatively easy using loop controls, but I can't figure out how to do it as I do not fully understand loop controls.. Right after I read in a line that contains > I want to check if that value is already present in the hash using something like this:

Code:

if exists ($fasta_hash{$hash_key}) && do {next LINE} ;

Now I came up with something like this (but it doesnt work):

Code:

LINE: while ($line = <FASTA>)
{
next LINE if $line =~ m/^#/; #discard comments, seems to be working
$line =~ s/[\x0A\x0D]+//g; 
$line =~ s/(\s+)(\n)(\r)//g;

if ($line =~ m/(^>\w+)/i)
{
$hash_key = "";
$hash_key = substr($1, 1) #removes first character from the string to yield the clean gen
if exists ($fasta_hash{$hash_key}) && do {next LINE} ;
while ($line !~ m/(^>\w+)/i)
{
	$fasta_hash{$hash_key}.=$line; #if the line doesnt contain a FASTA header the line is added to the value of the hash
}}}

Can anybody help me out and enlighten me on how I should use loop control in this case?

prex1 · Dec 21, 2009

Code:

open(FASTA,$fasta_filename)or die"\n\ncould not open $fasta_filename";
while($line=<FASTA>){
  chomp;
  $line=~s/^\s+//;
  if(ord$line==62){
        #if line starts with a >
    $line=~m/^>(\w+)/; #matches a word starting with > (Fasta)
    $hash_key=$1;
    $sequence='';
    while($line=<FASTA>){
      chomp;
      $line=~s/^\s+//;
      next if$line=~m/^#/; #discard comments
      last if ord$line==62;
      $line=~s/\s+$//;
      $sequence.=$line;
         #if the line doesnt contain a FASTA header the line is added to the value of the sequence
    }
    $fasta_hash{$hash_key}=$sequence unless exists$fasta_hash{$hash_key};
    redo unless eof(FASTA);
  }
}

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Timtico · Dec 21, 2009

Thanks alot! that code seems to work, I realized while planning the code that I needed to make a construction like you made, get into a loop after finding a ">" filling the hash and breaking out of the loop when finding the next ">". However, I failed at that since I didnt realize I can nest a while loop in a while loop with the same expression.. ofcourse that is possible!

I made some minor adjustments... your chomp didnt work since i did

Code:

while($line = FASTA)

instead of

Code:

while(FASTA)

, so I did

Code:

chomp $line

Also the

Code:

ord$line==62

, using the ASCII code for > was abit confusing and i replaced it with

Code:

 =~ m/^>/

. Why do you use ord()? so it can recognize the ">" in different text file formats?

prex1 · Dec 21, 2009

No, it is faster than a regexp, but of course this might be of little importance in your situation.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Timtico · Dec 23, 2009

Ah ok... by using ord() on a string, is the string converted to a scalar, resulting in only the first character of the string?. Ahwell, I could test it myself ofcourse

I will keep it's speedperformance in mind. Perhaps one day I want to search in the whole genome, and will build a hash using flatfiles of hundreds of MB.

Thanks for the help again, I know what place I will return to when I'm stuck.

stevexff · Dec 29, 2009

Perl is a popular choice for manipulating these FASTA files, and there are several pages of modules already up on CPAN. It is entirely possible that someone has already done the legwork for you...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

lines to a hash: skipping doubles using loop control

Timtico

Technical User

prex1

Programmer

Timtico

Technical User

prex1

Programmer

Timtico

Technical User

stevexff

Programmer

Similar threads

Part and Inventory Search

Sponsor