Perl performance issues

smokinace · Jun 30, 2008

I am trying to run a program that takes two fasta sequencing files and compares them. the program has to compare 38,000 file together to make 1 billion comparisons. when i run the program it does about 12,000 comparisons a minute. this would take 60 days to complete!
i am noticing that the program is only using 5 or 6 percent of the processor. The machine has 4 processors so i have no ideas why it uses so little power. Is there any way to speed up performance?

code:

#print "Input number?\n";
$input = $ARGV [1];

#print "output number?\n";
$output = $ARGV [2];

#print "comparison number?\n";
$file = $ARGV [3];
chomp $file;
# print "$match, $mismatch, $gap, $ext \n";

if ($helic =~ /y/) {

open (FILE, "helicoselist.txt") or die "Couldn't open location file: $!\n";
@filename = <FILE>;
$scalefile = @filename;
close (FILE);

$helifile=">/home/wjones/Shared/usbdrive/sequence/Sorted/helicose/helisheet" . $file . ".txt";
#print "helifile = $helifile\n";
open MYFILE, $helifile;
print MYFILE "Database\t","Query\t","Best Score\t","Percent\t", "Fraction\t","Line of Query Match\t", "Line of Subject Match\t","\n";
close (MYFILE);

}

else {

open (FILE, "refseqlist.txt") or die "Couldn't open location file: $!\n";
@filename = <FILE>;
$scalefile = @filename;
close (FILE);

$seqfile=">/home/wjones/Shared/usbdrive/sequence/Sorted/refseq/refsheet" . $file . ".txt";
open MYFILE, $seqfile;
print MYFILE "Database\t","Query\t","Best Score\t","Percent\t", "Fraction\t","Line of Query Match\t", "Line of Subject Match\t","\n";
close (MYFILE);
}

LINE :for ($i=$input; $i<=$output; $i++) {

for ($j=0; $j<$i; $j++) {

$database = $filename [$j];
chomp $database;
$query = $filename [$i];
chomp $query;
#print "database=$database , query=$query \n";

#`/home/wjones/Shared/usbdrive/sequence/blast-2.2.18/bin/bl2seq -i d001/NM_001083111.1 -j d001/NM_001007214.1 -p blastn -r 1 -q -1 -G 20 -E 2 > compared.txt`;

`/home/wjones/Shared/usbdrive/sequence/blast-2.2.18/bin/bl2seq -i "$query" -j "$database" -p blastn -r 1 -q -2 -G 20 -E 2 > compared"$file".txt`;
$compare = "compared" . $file . ".txt";
#print $compare, "\n";
open (COMP, $compare) or die "Couldn't open compared file\n";

$bestscore = 0;

while (<COMP>) { # sets each line to $_ for each iteration

$newscore = 0;
chomp $_;
#print "Text line: $_ \n";

if ($_ =~ / Score/) {

($var1, $score, $var3) = split /[()]/, $_; #use square brackets to parse on parentheses
#print "Var1 is: $var1 \n";
#print "The score is: $var2 \n";
#print "Var3 is: $var3\n";
$line2 = readline *COMP; #Identities
$line3 = readline *COMP; #Strand
$line4 = readline *COMP;
$line5 = readline *COMP;
$line6 = readline *COMP; #query line
$line7 = readline *COMP;
$line8 = readline *COMP; #subject line

#print "$line2";
#print "$line3";

if ($line3 =~ /Strand = Plus \/ Plus/) {

$newscore = $score;
#print "$newscore\n";
chomp $line2;
my ($dat1, $dat2, $dat3, $dat4, $dat5, $dat6) = split /[ ]/, $line2;
#print "$dat3,$dat4,$dat5\n";
($fraction, $excess1 ) = split /\//, $dat4;
#print "$fraction\n";
my ($excess2, $perctemp, $excess3) = split /[()]/, $dat5;
#print "$perctemp\n"; #percent with sign
($percent, $excess4) = split /%/, $perctemp;
#print "$percent\n";
($qu1, $queryln, $qu2, $qu3) = split / /, $line6;
#print "$queryln\n";
($qu1, $sbjctln, $qu2, $qu3) = split / /, $line8;
#print "$sbjctln\n";

}

if ($newscore > $bestscore) {

$bestscore = $newscore;
$bestfrac = $fraction;
$bestperc = $percent;
#print "$percbest, $percent";

}
}

#print "The score still is: $bestscore \n";

}

# print "The new score is: $bestcore, \n";
#print " $database\n $bestscore\n $query\n";
#print " percent: $bestperc Fraction: $bestfrac\n";

if ($helic =~ /y/) {
$helifile2 = ">" . $helifile;
open MYFILE, $helifile2;
if ($bestscore == 0) {

#print "Not valid!!\n";
print MYFILE "$database\t","$query\t","0\t","0\t", "0\t",".\t", ".\t","\n";
print MYFILE "$query\t","$database\t","0\t","0\t", "0\t",".\t", ".\t","\n";
close (MYFILE);

}

else {

print MYFILE "$database\t","$query\t","$bestscore\t","$bestperc\t", "$bestfrac\t","$queryln\t", "$sbjctln\t","\n";
print MYFILE "$query\t","$database\t","$bestscore\t","$bestperc\t", "$bestfrac\t","$queryln\t", "$sbjctln\t","\n"; # relationship is communative
close (MYFILE);

}
}

if ($helic =~ /n/) {
$seqfile2 = ">" . $seqfile;
open MYFILE, $seqfile2;
if ($bestscore == 0) {

#print "Not valid!!\n";
print MYFILE "$database\t","$query\t","0\t","0\t", "0\t",".\t", ".\t","\n";
print MYFILE "$query\t","$database\t","0\t","0\t", "0\t",".\t", ".\t","\n";
close (MYFILE);

}

else {

print MYFILE "$database\t","$query\t","$bestscore\t","$bestperc\t", "$bestfrac\t","$queryln\t", "$sbjctln\t","\n";
print MYFILE "$query\t","$database\t","$bestscore\t","$bestperc\t", "$bestfrac\t","$queryln\t", "$sbjctln\t","\n"; # relationship is communative
close (MYFILE);

}
}

}
}

print "Comparison Complete!!!!!!

\n";

travs69 · Jun 30, 2008

Maybe I'm being stupid.. but it looks like your doing a lot of writing out to files/reading from the same files during your comparisons. That is going to drive your IO up (you can tell on a unix machine.. I don't think windows has a way to show you that). Do all your work in memory and then write it out. Lots of hashes

I don't really understand your comparisons so I won't bother trying to optimize your code.

It might be the bl2seq slowing it down to. If your code can support it you might try forking it out also.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

stevexff · Jun 30, 2008

I/O is several orders of magnitude slower than memory access, as travis has noted. If you want to go fast, read everything into memory at the start, then just iterate over the array members. If you are doing this right you should be able to get up in the high nineties on one processor.

The machine has 4 processors so i have no ideas why it uses so little power.

a single process will only use one processor (the code can't be despatched on multiple processors simultaneously). You will have to write threaded code (and have an operating system that will allow the separate threads to be despatched on separate processors). Without belittling your achievements so far, if you are struggling to get above 5-6% CPU usage, I would suggest you leave the multithreading alone until you've nailed the I/O problem. This alone should net you about a 20-fold speed increase. If that still isn't enough, come back then and we'll talk.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

smokinace · Jul 1, 2008

i determined that the problem was the blast program. somewhere in the comparison there is a major holdup. without doing that comparison the program uses the entire processor. somehow i have to find a way to speed up the blast program. Also, we experimented with loading the prgoram and all the files on a ramdisk to speed up processing. We have over 4gb of ram, so there is plenty of space. Is this a good idea? Please excuse my poor programming skills. I just started programming in perl about a week ago and am still a big novice with hashes and such

stevexff · Jul 1, 2008

Right, regardless of what the blast program does, we can eliminate some I/O right off the bat by not writing the output to a file just so we can read it back in.

Perl:

my $blast = "/home/wjones/Shared/usbdrive/sequence/blast-2.2.18/bin/bl2seq -i \"$query\" -j "\$database\" -p blastn -r 1 -q -2 -G 20 -E 2";
my @blastout = qx{$blast};

foreach (@blastout) {
   # sets $_ to each entry in @blastout array
..
..
}

But the database query will be slowing it down also. Is the database on the same box? Do you have indexes on the data to speed up the queries? Does the database server have enough buffers to serve that data up quickly enough? It would probably be a good idea if you posted some sample output from the blast program, so we can find a better way to parse it.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

smokinace · Jul 1, 2008

Query= NM_032090.1
(2553 letters)

>NM_032603.2
Length = 3121

Score = 23.8 bits (12), Expect = 0.56
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 2320 ttcccccagccc 2331
||||||||||||
Sbjct: 464 ttcccccagccc 453

Score = 23.8 bits (12), Expect = 0.56
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 2107 gcggtctcctgc 2118
||||||||||||
Sbjct: 1697 gcggtctcctgc 1686

Score = 23.8 bits (12), Expect = 0.56
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 1048 gagctgaccatc 1059
||||||||||||
Sbjct: 2666 gagctgaccatc 2655

Score = 21.8 bits (11), Expect = 2.1
Identities = 11/11 (100%)
Strand = Plus / Plus

Query: 180 ccgggagctgg 190
|||||||||||
Sbjct: 321 ccgggagctgg 331

Score = 21.8 bits (11), Expect = 2.1
Identities = 11/11 (100%)
Strand = Plus / Plus

Query: 523 cagctcagccc 533
|||||||||||
Sbjct: 1758 cagctcagccc 1768

Lambda K H
1.33 0.621 1.12

Gapped
Lambda K H
1.33 0.621 1.12

Matrix: blastn matrix:1 -2
Gap Penalties: Existence: 20, Extension: 2
Number of Sequences: 1
Number of Hits to DB: 115
Number of extensions: 5
Number of successful extensions: 5
Number of sequences better than 10.0: 1
Number of HSP's gapped: 5
Number of HSP's successfully gapped: 5
Length of query: 2553
Length of database: 3121
Length adjustment: 13
Effective length of query: 2540
Effective length of database: 3108
Effective search space: 7894320
Effective search space used: 7894320
X1: 10 (19.2 bits)
X2: 15 (28.8 bits)
X3: 52 (100.0 bits)
S1: 10 (19.9 bits)
S2: 10 (19.9 bits)

The object is to get the highest score value for each comparison as long as the score is "plus/plus". I am doing a split on the score line to get the score inside the parentheses. The output is then put into a file. Each comparison lookes like this:

./d001/NM_001010883.2(database) ./d015/XR_015853.1(query)12(highest score) 100(percent)12(fraction)330(query matchline)980(database match line)

prex1 · Jul 1, 2008

I've rearranged a portion of your code as follows, for maximum performance. However, as noted by you and the responders, this won't change much in performance.
And please, next time you post some code or data, put them between [ignore]

Code:

[/ignore] tags for us to read!

Code:

    while (<COMP>) {
      $newscore = 0;
      unless(index($_,' Score')){
        $pp=index($_,'(');
        $uu=index($_,')',$pp);
        $score=substr($_,++$pp,$uu-$pp);
        $line2 = readline *COMP; #Identities
        unless(index(readline *COMP,'Plus / Plus')<0) {
          $newscore = $score;
          readline *COMP;
          readline *COMP;
          $line6 = readline *COMP; #query line
          readline *COMP;
          $line8 = readline *COMP; #subject line
          $pp=index($line2,'= ')+2;
          $uu=index($line2,'/',$pp);
          $fraction=substr($line2,$pp,$uu-$pp);
          $pp=index($line2,'(');
          $uu=index($line2,'%',$pp);
          $percent=substr($line2,++$pp,$uu-$pp);
          $pp=index($line6,' ');
          $uu=index($line6,' ',$pp);
          $queryln=substr($line6,++$pp,$uu-$pp);
          $pp=index($line8,' ');
          $uu=index($line8,' ',$pp);
          $sbjctln=substr($line8,++$pp,$uu-$pp);
        }
        if ($newscore > $bestscore) {
          $bestscore = $newscore;
          $bestfrac = $fraction;
          $bestperc = $percent;
        }
      }
    }

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

smokinace · Jul 2, 2008

forgive me for my ignorance but do i just put

Code:

 at the beginning in the message and then

at the end to put it into the code format?

smokinace · Jul 2, 2008

nevermind i figured it out!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Perl performance issues

smokinace

Programmer

travs69

MIS

stevexff

Programmer

smokinace

Programmer

stevexff

Programmer

smokinace

Programmer

prex1

Programmer

smokinace

Programmer

smokinace

Programmer

Similar threads

Part and Inventory Search

Sponsor