Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Any ideas on speeding up this simple script. 3

Status
Not open for further replies.

fumang

Technical User
May 14, 2002
7
US
I wrote a small script that would read a larger file and break it up into smaller files when it matched on certain characters. (these characters were the beginning of every file)

The problem is when I run it on a really huge file, like a file with a 100,000 seperate files the script really slows down. Any ideas on how to speed up the script? Would threads help?

Here is the script below, thanks for any help:


#split multiple NY/NE order files into single orders files.
#Declaring my variables

$num='0000000';
$count = 0;
$wramp='wramp';
$file='drsoap_sodata.txt';

open (FH, $file) or die "Can't open file";

while (<FH> ) {
$count++ if (/^(\d{3}\s\w{3}-\d{4}\s{5}\d|\d{3}\s\w{3}-\d{4}\w{5}\d)/);
if ($count == 1) {
$num++;
$count=0;
open NEW, &quot;> $wramp-$num.txt&quot; or die &quot;could not open $wramp-$num.txt : $!&quot;;
}
print NEW;
}
close FH;
close NEW;

THANKS!!
 
Programmatically, this looks to be just about as tight as it can get. The only way I see to make it faster might be to rewrite it in C, and even that doesn't guarantee improvement since I/O is likely the cause of the slowdown.

If it's practical, I'd consider a systemic solution, ensuring that the input file and the output files are on different physical devices. That way, the disk read and write operations can occur concurrently.

Is there anything you know about the component files? If you know they are at least N lines long, that would allow you to skip the regexp part of the time?
 
one way to speed it up a little is to precompile the regex like this:

Code:
$pat = qr/^(\d{3}\s\w{3}-\d{4}\s{5}\d|\d{3}\s\w{3}-\d{4}\w{5}\d)/;

while (<FH> ) { 
$count++ if (/$pat/); 
if ($count == 1) { 
...

Precompilation of the pattern into an internal representation at the moment of qr() avoids a need to recompile the pattern every time a match /$pat/ is attempted. (Perl has many other internal optimizations, but none would be triggered in the above example if we did not use qr() operator.)

One other way maybe to push the results in an array (push (@arr, $_)) and write it once when you exit the loop.. dunno for sure, but worth a shot anyway

---
cheers!
san
pipe.gif


&quot;The universe has been expanding, and Perl's kind of been expanding along with the universe&quot; - Larry Wall
 
You might be able to pull the entire file into a scalar, then use the 'study' function to study the string, and then perform repeated match operations, on the string.

The 'study' function studies a &quot;scalar in anticipation of doing many patter mathes on the the string...&quot;. (Programming Perl, Wall, Schwartz, Christiansen, pg. 225)

If the matching is the slow down, then using 'study' MAY speed it up..... ' don't know, haven't tried it but it may be worth a try.

'hope this helps

If you are new to Tek-Tips, please use descriptive titles, check the FAQs, and beware the evil typo.
 
Thanks everyone for the help, I have tried all your suggestions, however I believe that I/O is likely the cause of the slowdown. Thinking about doing this in C.
Thanks again
 
fumang,

With respect - I would suggest that it I/O is your problem that writing it in C will not help.

Mike

Want to get great answers to your Tek-Tips questions? Have a look at faq219-2884

It's like this; even samurai have teddy bears, and even teddy bears get drunk.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top