Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Optimizing PERL Program

Status
Not open for further replies.

benkmann

ISP
Jun 2, 2006
10
US
Hey guys,

I'm writing a program in PERL that opens up a large txt file that is made up of lines of names & urls. It needs to access each URL and pull another line from the URL.

Currently, the program uses LWP:simple and uses getstore() to store the URL in a file. THe program then reads the file, finds the necessary information, and then deletes the file.

Problem is, the program kaputs even before it reaches the "B"s. Is it inherently going to be slow/die because of the sheer number of URLS it has to tlook through, or would there be a way to speed it up?

It has to go through 6791 URLs, and each of those URLs is a txt file about 13060 lines long. Clearly, it is a work-intensive problem, but I believe it can be done more efficiently. Any ideas will be appreciated, thank you!

----------code----------------

#!/usr/bin/perl

# Tell perl to send a html header.
# So your browser gets the output
# rather then <stdout>(command line
# on the server.)
print "Content-type: text/html\n\n";

use LWP::Simple;

$base_url = "$data_file = '2005_q1.txt';

open DATA, "$data_file" or die "can't open $data_file $!";
@array_of_data = <DATA>;
#while (<DATA>) {
#}
close (DATA);

# start foreach loop, and assign each line,
# One at a time to the variable $line.
$count = 0;
MAIN: foreach $line (@array_of_data) {

if ($line =~ m/10-K/i) {

@sub_data = split(/ +/, $line);

foreach $i (@sub_data) {

if ($i =~ m/edgar/i) {
$url = $i;
$temp = $i;
$temp =~ s/edgar\/data\///;
$temp =~ s/\/.*//;
$cik = $temp;

}
if ($i =~ /^\d{4}-\d{2}-\d{2}/) {
$date = $i;
}
}

$line =~ s/$url//;
$line =~ s/$date//;
$line =~ s/$cik//;
$line =~ s/10-K//;
$url = $base_url . $url;


getstore($url, "temp.txt");
open DATA, "temp.txt" or die "can't open $data_file $!";
@data = <DATA>;

close (DATA);
$search_term = "fiscal year end";
$done = 0;
INNER: foreach $j (@data) {
$j =~ s/<.*>//g;
if ($done == 0) {
if($j =~ m/$search_term/i) {
$j =~ s/[^0-9]+//i;
$fiscal_date = $j;
$done = 1;
last INNER;
}
}


}
unlink("temp.txt");
$companyname = $line;
if (-e "data.txt") {
open(FILE, ">>data.txt");
print FILE "$companyname|$cik|$date|$url|$fiscal_date\n";
close(FILE);
}
else {
open(FILE, ">data.txt");
print FILE "$companyname|$cik|$date|$url|$fiscal_date\n";
close(FILE);
}
$count = $count + 1;
}

}
 
It never ceases to amaze me how people insist on reading large files into arrays just to process them record by record.
The fundamental issue here (accepting and agreeing with KevinADC's comment) is memory.
Why do this:
Code:
open DATA, "$data_file" or die "can't open $data_file $!";
@array_of_data = <DATA>;
close DATA;
foreach $line (@array_of_data) {
when you could do this:
Code:
open DATA, "$data_file" or die "can't open $data_file $!";
my $line;
while(<DATA>) {
    chomp;
    $line = $_;

Where the former can eat all your memory whereas the latter uses hardly any?
The same principle applies in other places in your code.
You must learn to think about not storing data that you don't need.
Have a re-think about your code and see if you can reduce it's memory requirements. Also I'd listen to KevinADC if I were you, he's a smart cookie and his suggestion is a good one. If this process needs to do this much work, you'll get http timeouts if you try to run it as a cgi page.



Trojan.
 
is there a coralation between doing this and the same with SQL data records.

I use arrays of hashes to store records selected from a database.

so my @records=&getSQL("TABLE","COLUMNS","WHERE");

that calls the SQL routine, which makes the DB request, fetches the required records, stores them in an array of hashes and passes the array back.

Then I work with the array of hashes to do my processing.

Is this bad practice?, I wrote this SQL module because I really didn't want to keep coding
Code:
# Open DB Connection

my $db = new Win32::ODBC("FILEDSN=$DSN;") || die "getSQL Error Connecting: " . Win32::ODBC::Error();

# Run SQL Command
if(!$db->Sql("$sel")) {

	# Loop SQL and do something
	while($db->FetchRow()){ 
	    my %dt = $db->DataHash();
		processing goes here	
	}

	# Close DB Connection
	$db->Close();
} 
else{die "Error in getSQL ($sel)" . Win32::ODBC::Error();}
doesn't this just create memory space for all the records anyway and then you use the FetchRow method to move through the data.

And also doesn't this keep a connection open to the DB while the processing is being done.

Where as my solution opens connection, grabs record, stuffs them in an easy to use array of hashes and shuts the DB connection.

It seems to run fine, but your comments leave me with a concern over using arrays to store DB data and thus eating memory.

Ive never had a script timeout on me because of memory usage and my webhost would soon be on my case if my site started consuming too much resources.

What's your thoughts?


"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you.
 
Reading stuff into memory is OK if there's not too much of it, or you don't need to run too many concurrent processes. The crucial point here being the definition of 'not too much', which depends on how big your box is, and what you need to do with the data once you've read it.

Back to the OP. Apart from blowing out due to insufficient memory, I'd see the next big reason for slowness would have to be the sheer volume of data you're pulling across the wire (approx 7K x however big each file is). Even with a fat pipe, that's going to take a while. Is there any way you can avoid retrieving the whole file for each URL (like the information you need is available in the header, for example).

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Yup I hear ya on the remote thing, and thankfully the amount of records fetched is less than a hundred if ever 1=1 is used for a select.

However, I was asking this with the idea that the PERL and DB are on the same server OK, it's an MS Access backend , but on the same server so, the access from PERL to DB is all local not via URL requests.

Will that make a difference and only be a problem if i run out of memory.

How much memory does a webhost allow a single website to grab when it executes a perl script, or do they only limit the time for execution and let PERL use what ever it needs, or is the PERL install somehow sandboxed to each website hosted?

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you.
 
LOL. Trojan, I never insisted on anything, I told it like I wrote it.

That's why I come to these forums, someone may easily see a solution that I'm not seeing or was ever exposed to.

So, thanks for the while loop, I'll try that out and will update!

Now, do I need to process the whole file? No, the information I need *should* be near the top of the file, though in ~ 10% of cases its further in the file. I tried to incorporate a break to break out of reading through the file once it found the information, but it didn't seem to help much.

I'm curious by that statement "the information you need is available in the header", because pretty much that's my case. What ideas did you have stevexff?

Thanks!
 
Would you mind posting your '2005_q1.txt' file? I would like to play with the program. At the least, paste a few lines from the file here so that I can setup my own .txt file and work with it.
 
Sure! (BTW: Each url needs to have added to beginning)

------------2005_q1.txt---------------

10-K 1 800 CONTACTS INC 1050122 2005-03-17 edgar/data/1050122/0001047469-05-006925.txt
10-K 1ST CONSTITUTION BANCORP 1141807 2005-03-24 edgar/data/1141807/0001193125-05-059919.txt
10-K 1ST SOURCE CORP 34782 2005-03-16 edgar/data/34782/0000034782-05-000025.txt
10-K 21ST CENTURY HOLDING CO 1069996 2005-03-31 edgar/data/1069996/0001144204-05-009759.txt
10-K 21ST CENTURY INSURANCE GROUP 100331 2005-02-17 edgar/data/100331/0001015402-05-000854.txt
10-K 24/7 REAL MEDIA INC 1062195 2005-03-16 edgar/data/1062195/0001047469-05-006790.txt
10-K 3D SYSTEMS CORP 910638 2005-03-10 edgar/data/910638/0001047469-05-005988.txt
10-K 3M CO 66740 2005-02-24 edgar/data/66740/0001104659-05-008057.txt


--------------Processed file--------------

1 800 CONTACTS INC|1050122|2005-03-17|edgar/data/1050122/0001047469-05-006925.txt
1ST CONSTITUTION BANCORP|1141807|2005-03-24|edgar/data/1141807/0001193125-05-059919.txt
1ST SOURCE CORP|34782|2005-03-16|edgar/data/34782/0000034782-05-000025.txt
21ST CENTURY HOLDING CO|1069996|2005-03-31|edgar/data/1069996/0001144204-05-009759.txt
21ST CENTURY INSURANCE GROUP|100331|2005-02-17|edgar/data/100331/0001015402-05-000854.txt
24/7 REAL MEDIA INC|1062195|2005-03-16|edgar/data/1062195/0001047469-05-006790.txt
3D SYSTEMS CORP|910638|2005-03-10|edgar/data/910638/0001047469-05-005988.txt
3M CO|66740|2005-02-24|edgar/data/66740/0001104659-05-008057.txt

--------------------------------------------------------

Now it needs to open each of those URLs to look for the fiscal end year date, which *usually* appears in the first 20 lines or so.
 
Update: I now run it locally, using Trojan's (<DATA>) loops, and the program successfully ran in several hours.

One possibility in making it more efficient is can I read the file remotely, without having to copy the whole file to my disk? That is, access the file and read it line by line. Inherently, the getstore() function does this and prints it line by line into a new file.

Any ideas?
 
benkmann,

Glad the post helped, sorry if I was a little exasperated.
Suffice to say that you are not the first I've explained that to (I'm easily into 3 figures there!).

I guess we all have to learn somewhere so I'm glad you asked and got something useful back.

;-)



Trojan.
 
Trojan does it again!!!!! [thumbsup2]

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you.
 
Thanks for posting your file. I worked it over and made some changes.

Code:
use warnings;
use strict;
use LWP::Simple;
# forget print buffering, let's see the results as they happen
$|++;
my $base_url = "[URL unfurl="true"]http://www.sec.gov/Archives/";[/URL]
my $data_file = 'c:/temp/2005_q1.txt';

open DATA, "$data_file" or die "can't open $data_file $!";
while (<DATA>) {
    # initialize variables - strict and warnings are good practice
    my ($company_name,$cik,$date,$url,$fiscal_date);
    my $count = 0;
    chomp;
    my $line = $_;
    if ($line =~ m/10-K/i) {
        # why not get everything in on swoop?
        ($company_name,$cik,$date,$url) = (split(/\s{2,}/, $line))[1,2,3,4];
        $url = $base_url . $url;
        # feed the source into a variable instead of a file
        my $doc = get($url);
        $doc =~ s/.+?fiscal year end:\s+?\<b\>(\d{4}).*/$1/si;
        $fiscal_date = $doc;
        # let's check it
        print "$company_name|$cik|$date|$url|$fiscal_date\n";
        $count = $count + 1;
    }
}

The code really isn't that much different than before. The most significant change comes in your source of the fiscal date. Why grab the text file when you can just see an html summary? In your 2005_q1.txt file, do a search for .txt and replace it with -index.htm, like this:

10-K 1 800 CONTACTS INC 1050122 2005-03-17 edgar/data/1050122/0001047469-05-006925.txt

10-K 1 800 CONTACTS INC 1050122 2005-03-17 edgar/data/1050122/0001047469-05-006925-index.htm


I ran benchmark tests against parsing text and parsing html. Using the companies that you listed above, I got a result of 126 wallclock seconds using the text method. Parsing the html returns a result of 3 wallclock seconds. You could very possibly process your entire file in 30-40 minutes by making that change!
 
Oh wow, that worked wonders! Here's my modified code:

use warnings;
use strict;
use LWP::Simple;
# forget print buffering, let's see the results as they happen
$|++;
my $base_url = "my $data_file = '2005_q1.txt';
my $no_date = 0;
my $total = 0;
open DATA, "$data_file" or die "can't open $data_file $!";
while (<DATA>) {
# initialize variables - strict and warnings are good practice
my ($company_name,$cik,$date,$url,$fiscal_date);
chomp;
my $line = $_;
if ($line =~ m/10-K/i) {
# why not get everything in one swoop?
($company_name,$cik,$date,$url) = (split(/\s{2,}/, $line))[1,2,3,4];
$url = $base_url . $url;
$url =~ s/.txt/-index.html/;
# feed the source into a variable instead of a file
my $doc = get($url);
if ($doc =~ s/.+?fiscal year end:\s+?\<b\>(\d{4}).*/$1/si) {
$fiscal_date = $doc;
}
else {
$fiscal_date = "no fiscal date found";
$no_date++;
}
$total++;
# let's check it
open FILE, ">>temp.txt";
print FILE "$company_name|$cik|$date|$url|$fiscal_date\n";
close(FILE);
print "$company_name|$cik|$date|$url|$fiscal_date\n";
print "No Date Found: $no_date\n";
print "Total Entries: $total\n";
}
}

So after the program ran, I had 516 entries out of 6791 that did not have a fiscal year end date in their summary page, or about a 7.6% rate.

I examined the root directory of a few companies, and the only similarities they have is the complete *.txt file, and the summary *-index.html file. They then have several broken down files, but the names are up to the company's choice: there is no standard with the other files.

So, would opening the complete txt file in these 516 cases, searching it for fiscal year, replacing months with corresponding numbers, and then pulling the date be in line with the efficiency goals, in your opinion?
 
If there is no other way to get the 516 dates than by parsing text, that is absolutely in line with efficiency goals. Write the errant lines out to another file which you can come back to and process later. In the meantime, run the rest and get them done. You would have to change the regex a bit to parse the text file. I think it should be as such:

if ($doc =~ s/.+?fiscal year end:\s+?(\d{4}).*/$1/si) {


Glad it worked out for you. Cheers.
 
Raklet,

Can you explain how that regex works? I'm especially confused about the /$1/, and the difference between .+ & .*.

Thanks!
 
Here it is broken down in pieces

$doc =~ s/.+?fiscal year end:\s+?(\d{4}).*/$1/si

s/ # start of regular expression with substitution

.+? # match one or more of any characters in non-greedy mode
# that means it will stop matching at first occurence of
# whatever comes after the question mark. In this case
# it is "fiscal year"
# if you do not make the match non-greedy, it will suck
# up all results until it find the last occurence of
# 'fiscal year' (meaning it can grab multiple
# occurences)
# non-greedy mode is indicated by the question mark
# .+ is one or more of any character
# .* would mean match any of zero or more characters

fiscal year end: # match the string "fiscal year end:"

\s+? # match one or more spaces in non-greedy mode.
# that means match only the spaces that occur between
# "fiscal year end:" and the next occuring digits

(\d{4}) # match four and only four digits, save the digits
# into the built in variable $1. saving is indicated
# by the use of (). Multiple sets of () will save
# multiple items - each in sequentialy larger built
# in variables ($1 $2 $3 $4 etc)

.* # match zero or more of anything all the way to the end
# of the string. we want to get it all, so we can
# zap it all.

/ # end of match portion, start of substitution section

$1 # replace everything in the entire string with just
# the item found in () - in other words, the four
# digits

/ # end of substitution

si # match across line breaks and ignore case. default
# behavior is to stop matching at line breaks and
# matching is case sensitive.

Regards,

Raklet
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top