In need of a more efficient way to approach this task

sophisticatedpenguin · Apr 19, 2008

Hi,

I'm VERY new to Perl. All I've learnt so far is some regular expressions, opening and closing files, and simple 'while' and 'if' loops. I wrote the following code to try to accomplish the following:
-open an input file
-for each line:
-find the text string before the tab character
-open a comparison file
-for each line:
-set a variable to "false"
-try to match the text string
-if it matches, set the variable to "true" and end the loop
-if the variable is still false at the end, write the text string to an output file
-then repeat for all the other lines of the input file

Here is the code (please try not to laugh too loudly):

Code:

#!C:\Perl\bin
use strict;
use diagnostics;
open(IFILE,"source.txt");
open(OFILE, ">compare_output.txt");
while (<IFILE>)
{
	/\t/;
	$phrase = $`;
	$found = "false";
	open (CFILE,"compare_with.xml");
	while (<CFILE>)
	{
		if (/$phrase/i)
		{
			$found = "true";
			last;
		}
	}
	if ($found eq "false")
		{
			print OFILE "$phrase \n";
		}
	close (CFILE);
}

The thing is that it worked fine when each file only had about 10 lines in it. But the real files are much bigger, and hours later the program is still churning ... Is there a more efficient way to do the same task?

Many thanks in advance for your patience with someone who is well out of her depth

SP

rharsh · Apr 19, 2008

So the first thing I noticed is the use of the variable $`:

perldoc perlvar said:
$` The string preceding whatever was matched by the last successful
pattern match (not counting any matches hidden within a BLOCK or
eval enclosed by the current BLOCK). (Mnemonic: "`" often
precedes a quoted string.) This variable is read-only.

The use of this variable anywhere in a program imposes a
considerable performance penalty on all regular expression
matches. See "BUGS".

I suspect that isn't as much of a problem as iterating through the "compare_with.xml" file for every line in "source.txt". What does compare_with.xml look like? Is there a specific place in the xml structure that would contain the match to the line from source.txt?

PC888 · Apr 19, 2008

I notice that the file source.txt is opened once and the file compare_with.xml is opened many times.

Not knowing the size of each file, I would suggest to scan source.txt once and make an array of expressions out of the file. Subsequently loop through file compare_with.xml once, and compare each line with each expression of the array created earlier. This way, much of the work will be done in memory. You can transpose the lines of the report to per expression later if necessary.

prex1 · Apr 19, 2008

A first step in improving efficiency is by using [tt]split[/tt] in place of regexps, as below

Code:

($phrase)=split/\t/;

Also the use of [tt]index[/tt] in place of a regexp would speed up things in the inner while loop, however if you need the case insensitive option I'm not sure whether this would help, as you would need to uppercase both strings before [tt]index[/tt].
I second also the suggestion from PC888, however you should specify an expected maximum size of both source files, before going further.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

sophisticatedpenguin · Apr 20, 2008

Thanks very much everyone. Sounds like the next things I need to try to learn are (1) how to do arrays and (2) how to use split. I'll go away and see if I can get my head round them!
SP

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

In need of a more efficient way to approach this task

sophisticatedpenguin

Technical User

rharsh

Technical User

PC888

Programmer

prex1

Programmer

sophisticatedpenguin

Technical User

Similar threads

Part and Inventory Search

Sponsor