Hey guys,
I'm writing a program in PERL that opens up a large txt file that is made up of lines of names & urls. It needs to access each URL and pull another line from the URL.
Currently, the program uses LWP:simple and uses getstore() to store the URL in a file. THe program then reads the file, finds the necessary information, and then deletes the file.
Problem is, the program kaputs even before it reaches the "B"s. Is it inherently going to be slow/die because of the sheer number of URLS it has to tlook through, or would there be a way to speed it up?
It has to go through 6791 URLs, and each of those URLs is a txt file about 13060 lines long. Clearly, it is a work-intensive problem, but I believe it can be done more efficiently. Any ideas will be appreciated, thank you!
----------code----------------
#!/usr/bin/perl
# Tell perl to send a html header.
# So your browser gets the output
# rather then <stdout>(command line
# on the server.)
print "Content-type: text/html\n\n";
use LWP::Simple;
$base_url = "$data_file = '2005_q1.txt';
open DATA, "$data_file" or die "can't open $data_file $!";
@array_of_data = <DATA>;
#while (<DATA>) {
#}
close (DATA);
# start foreach loop, and assign each line,
# One at a time to the variable $line.
$count = 0;
MAIN: foreach $line (@array_of_data) {
if ($line =~ m/10-K/i) {
@sub_data = split(/ +/, $line);
foreach $i (@sub_data) {
if ($i =~ m/edgar/i) {
$url = $i;
$temp = $i;
$temp =~ s/edgar\/data\///;
$temp =~ s/\/.*//;
$cik = $temp;
}
if ($i =~ /^\d{4}-\d{2}-\d{2}/) {
$date = $i;
}
}
$line =~ s/$url//;
$line =~ s/$date//;
$line =~ s/$cik//;
$line =~ s/10-K//;
$url = $base_url . $url;
getstore($url, "temp.txt");
open DATA, "temp.txt" or die "can't open $data_file $!";
@data = <DATA>;
close (DATA);
$search_term = "fiscal year end";
$done = 0;
INNER: foreach $j (@data) {
$j =~ s/<.*>//g;
if ($done == 0) {
if($j =~ m/$search_term/i) {
$j =~ s/[^0-9]+//i;
$fiscal_date = $j;
$done = 1;
last INNER;
}
}
}
unlink("temp.txt");
$companyname = $line;
if (-e "data.txt") {
open(FILE, ">>data.txt");
print FILE "$companyname|$cik|$date|$url|$fiscal_date\n";
close(FILE);
}
else {
open(FILE, ">data.txt");
print FILE "$companyname|$cik|$date|$url|$fiscal_date\n";
close(FILE);
}
$count = $count + 1;
}
}
I'm writing a program in PERL that opens up a large txt file that is made up of lines of names & urls. It needs to access each URL and pull another line from the URL.
Currently, the program uses LWP:simple and uses getstore() to store the URL in a file. THe program then reads the file, finds the necessary information, and then deletes the file.
Problem is, the program kaputs even before it reaches the "B"s. Is it inherently going to be slow/die because of the sheer number of URLS it has to tlook through, or would there be a way to speed it up?
It has to go through 6791 URLs, and each of those URLs is a txt file about 13060 lines long. Clearly, it is a work-intensive problem, but I believe it can be done more efficiently. Any ideas will be appreciated, thank you!
----------code----------------
#!/usr/bin/perl
# Tell perl to send a html header.
# So your browser gets the output
# rather then <stdout>(command line
# on the server.)
print "Content-type: text/html\n\n";
use LWP::Simple;
$base_url = "$data_file = '2005_q1.txt';
open DATA, "$data_file" or die "can't open $data_file $!";
@array_of_data = <DATA>;
#while (<DATA>) {
#}
close (DATA);
# start foreach loop, and assign each line,
# One at a time to the variable $line.
$count = 0;
MAIN: foreach $line (@array_of_data) {
if ($line =~ m/10-K/i) {
@sub_data = split(/ +/, $line);
foreach $i (@sub_data) {
if ($i =~ m/edgar/i) {
$url = $i;
$temp = $i;
$temp =~ s/edgar\/data\///;
$temp =~ s/\/.*//;
$cik = $temp;
}
if ($i =~ /^\d{4}-\d{2}-\d{2}/) {
$date = $i;
}
}
$line =~ s/$url//;
$line =~ s/$date//;
$line =~ s/$cik//;
$line =~ s/10-K//;
$url = $base_url . $url;
getstore($url, "temp.txt");
open DATA, "temp.txt" or die "can't open $data_file $!";
@data = <DATA>;
close (DATA);
$search_term = "fiscal year end";
$done = 0;
INNER: foreach $j (@data) {
$j =~ s/<.*>//g;
if ($done == 0) {
if($j =~ m/$search_term/i) {
$j =~ s/[^0-9]+//i;
$fiscal_date = $j;
$done = 1;
last INNER;
}
}
}
unlink("temp.txt");
$companyname = $line;
if (-e "data.txt") {
open(FILE, ">>data.txt");
print FILE "$companyname|$cik|$date|$url|$fiscal_date\n";
close(FILE);
}
else {
open(FILE, ">data.txt");
print FILE "$companyname|$cik|$date|$url|$fiscal_date\n";
close(FILE);
}
$count = $count + 1;
}
}