Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

In need of a more efficient way to approach this task

Status
Not open for further replies.

sophisticatedpenguin

Technical User
Oct 12, 2005
31
GB
Hi,

I'm VERY new to Perl. All I've learnt so far is some regular expressions, opening and closing files, and simple 'while' and 'if' loops. I wrote the following code to try to accomplish the following:
-open an input file
-for each line:
-find the text string before the tab character
-open a comparison file
-for each line:
-set a variable to "false"
-try to match the text string
-if it matches, set the variable to "true" and end the loop
-if the variable is still false at the end, write the text string to an output file
-then repeat for all the other lines of the input file

Here is the code (please try not to laugh too loudly):
Code:
#!C:\Perl\bin
use strict;
use diagnostics;
open(IFILE,"source.txt");
open(OFILE, ">compare_output.txt");
while (<IFILE>)
{
	/\t/;
	$phrase = $`;
	$found = "false";
	open (CFILE,"compare_with.xml");
	while (<CFILE>)
	{
		if (/$phrase/i)
		{
			$found = "true";
			last;
		}
	}
	if ($found eq "false")
		{
			print OFILE "$phrase \n";
		}
	close (CFILE);
}
The thing is that it worked fine when each file only had about 10 lines in it. But the real files are much bigger, and hours later the program is still churning ... Is there a more efficient way to do the same task?

Many thanks in advance for your patience with someone who is well out of her depth :)

SP
 
So the first thing I noticed is the use of the variable $`:
perldoc perlvar said:
$` The string preceding whatever was matched by the last successful
pattern match (not counting any matches hidden within a BLOCK or
eval enclosed by the current BLOCK). (Mnemonic: "`" often
precedes a quoted string.) This variable is read-only.

The use of this variable anywhere in a program imposes a
considerable performance penalty on all regular expression
matches. See "BUGS".
I suspect that isn't as much of a problem as iterating through the "compare_with.xml" file for every line in "source.txt". What does compare_with.xml look like? Is there a specific place in the xml structure that would contain the match to the line from source.txt?
 
I notice that the file source.txt is opened once and the file compare_with.xml is opened many times.

Not knowing the size of each file, I would suggest to scan source.txt once and make an array of expressions out of the file. Subsequently loop through file compare_with.xml once, and compare each line with each expression of the array created earlier. This way, much of the work will be done in memory. You can transpose the lines of the report to per expression later if necessary.
 
A first step in improving efficiency is by using [tt]split[/tt] in place of regexps, as below
Code:
($phrase)=split/\t/;
Also the use of [tt]index[/tt] in place of a regexp would speed up things in the inner while loop, however if you need the case insensitive option I'm not sure whether this would help, as you would need to uppercase both strings before [tt]index[/tt].
I second also the suggestion from PC888, however you should specify an expected maximum size of both source files, before going further.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Thanks very much everyone. Sounds like the next things I need to try to learn are (1) how to do arrays and (2) how to use split. I'll go away and see if I can get my head round them!
SP
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top