Any ideas on speeding up this simple script. 3

fumang · Sep 17, 2003

I wrote a small script that would read a larger file and break it up into smaller files when it matched on certain characters. (these characters were the beginning of every file)

The problem is when I run it on a really huge file, like a file with a 100,000 seperate files the script really slows down. Any ideas on how to speed up the script? Would threads help?

Here is the script below, thanks for any help:

#split multiple NY/NE order files into single orders files.
#Declaring my variables

$num='0000000';
$count = 0;
$wramp='wramp';
$file='drsoap_sodata.txt';

open (FH, $file) or die "Can't open file";

while (<FH> ) {
$count++ if (/^(\d{3}\s\w{3}-\d{4}\s{5}\d|\d{3}\s\w{3}-\d{4}\w{5}\d)/);
if ($count == 1) {
$num++;
$count=0;
open NEW, "> $wramp-$num.txt" or die "could not open $wramp-$num.txt : $!";
}
print NEW;
}
close FH;
close NEW;

THANKS!!

sampsonr · Sep 19, 2003

Programmatically, this looks to be just about as tight as it can get. The only way I see to make it faster might be to rewrite it in C, and even that doesn't guarantee improvement since I/O is likely the cause of the slowdown.

If it's practical, I'd consider a systemic solution, ensuring that the input file and the output files are on different physical devices. That way, the disk read and write operations can occur concurrently.

Is there anything you know about the component files? If you know they are at least N lines long, that would allow you to skip the regexp part of the time?

133tcamel · Sep 19, 2003

one way to speed it up a little is to precompile the regex like this:

Code:

$pat = qr/^(\d{3}\s\w{3}-\d{4}\s{5}\d|\d{3}\s\w{3}-\d{4}\w{5}\d)/;

while (<FH> ) { 
$count++ if (/$pat/); 
if ($count == 1) { 
...

Precompilation of the pattern into an internal representation at the moment of qr() avoids a need to recompile the pattern every time a match /$pat/ is attempted. (Perl has many other internal optimizations, but none would be triggered in the above example if we did not use qr() operator.)

One other way maybe to push the results in an array (push (@arr, $_)) and write it once when you exit the loop.. dunno for sure, but worth a shot anyway

---
cheers!
san

"The universe has been expanding, and Perl's kind of been expanding along with the universe" - Larry Wall

goBoating · Sep 19, 2003

You might be able to pull the entire file into a scalar, then use the 'study' function to study the string, and then perform repeated match operations, on the string.

The 'study' function studies a "scalar in anticipation of doing many patter mathes on the the string...". (Programming Perl, Wall, Schwartz, Christiansen, pg. 225)

If the matching is the slow down, then using 'study' MAY speed it up..... ' don't know, haven't tried it but it may be worth a try.

'hope this helps

If you are new to Tek-Tips, please use descriptive titles, check the FAQs, and beware the evil typo.

fumang · Sep 22, 2003

Thanks everyone for the help, I have tried all your suggestions, however I believe that I/O is likely the cause of the slowdown. Thinking about doing this in C.
Thanks again

MikeLacey · Sep 22, 2003

fumang,

With respect - I would suggest that it I/O is your problem that writing it in C will not help.

Mike

Want to get great answers to your Tek-Tips questions? Have a look at faq219-2884

It's like this; even samurai have teddy bears, and even teddy bears get drunk.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Any ideas on speeding up this simple script. 3

fumang

Technical User

sampsonr

Programmer

133tcamel

Programmer

goBoating

Programmer

fumang

Technical User

MikeLacey

MIS

Similar threads

Part and Inventory Search

Sponsor