Hello guys,
On a bioinformatics related project I want to read in the contents of a file (fasta format) to a fastafile. The file looks for example as follows:
Now, I'm building a hash using the following code, works perfectly fine.. However, the problem is that if I read in files that contain duplicate headers with sequences (like in the sample above) the value of the second duplicate is added to the first. Which would lead to key:1 value:sequence1sequence1 (double the sequence1)
Now what I want to do:
- read in the lines, make a key if the line starts with a >, assign all the following lines until the next line that contains a > to the value... I know it should be relatively easy using loop controls, but I can't figure out how to do it as I do not fully understand loop controls.. Right after I read in a line that contains > I want to check if that value is already present in the hash using something like this:
Now I came up with something like this (but it doesnt work):
Can anybody help me out and enlighten me on how I should use loop control in this case?
On a bioinformatics related project I want to read in the contents of a file (fasta format) to a fastafile. The file looks for example as follows:
Code:
>1
sequence1
>2
sequence2
>3
sequence3
>1
sequence1
Now, I'm building a hash using the following code, works perfectly fine.. However, the problem is that if I read in files that contain duplicate headers with sequences (like in the sample above) the value of the second duplicate is added to the first. Which would lead to key:1 value:sequence1sequence1 (double the sequence1)
Code:
open (FASTA, $fasta_filename) or die ("\n\ncould not open $fasta_filename");
LINE: while ($line = <FASTA>)
{
next LINE if $line =~ m/^#/; #discard comments
if ($line =~ /^>/) #if line starts with a >
{
$line =~ s/[\x0A\x0D]+//g; #removing those ugly whitelines and returns, bleh
$line =~ s/(\n)(\r)//g; #remove them alllll!
$line =~ m/(^>\w+)/i; #matches a word starting with > (Fasta)
$hash_key = substr($1, 1) #removes first character from the string to yield the clean gen
}
else #if the line doesnt contain a FASTA header the line is added to the value of the hash instead
{
$line =~ s/[\x0A\x0D]+//g;
$line =~ s/(\s+)(\n)(\r)//g;
$fasta_hash{$hash_key}.=$line; #if the line doesnt contain a FASTA header the line is added to the value of the hash
}
}
Now what I want to do:
- read in the lines, make a key if the line starts with a >, assign all the following lines until the next line that contains a > to the value... I know it should be relatively easy using loop controls, but I can't figure out how to do it as I do not fully understand loop controls.. Right after I read in a line that contains > I want to check if that value is already present in the hash using something like this:
Code:
if exists ($fasta_hash{$hash_key}) && do {next LINE} ;
Now I came up with something like this (but it doesnt work):
Code:
LINE: while ($line = <FASTA>)
{
next LINE if $line =~ m/^#/; #discard comments, seems to be working
$line =~ s/[\x0A\x0D]+//g;
$line =~ s/(\s+)(\n)(\r)//g;
if ($line =~ m/(^>\w+)/i)
{
$hash_key = "";
$hash_key = substr($1, 1) #removes first character from the string to yield the clean gen
if exists ($fasta_hash{$hash_key}) && do {next LINE} ;
while ($line !~ m/(^>\w+)/i)
{
$fasta_hash{$hash_key}.=$line; #if the line doesnt contain a FASTA header the line is added to the value of the hash
}}}
Can anybody help me out and enlighten me on how I should use loop control in this case?