Need time-out procedure for HTML perl procedure 1

bulgin · Aug 16, 2010

I have the following perl script which works nicely - it imports a list of urls, goes out to them, grabs the Head data and writes it to a file. Problem is, if it encounters a domain that is unresponsive or takes long to load, it halts and just sits there. I'm wondering if there is a way to make the script go to the next line if it's having difficulty connecting to the domain.

I'm also wondering if it is possible to create "threads" like the big-shot developers do? And in the extreme, this script may be looking at a list of thousands of URLs, so does anyone see any problem with this script handling that much overhead and not crashing out?

Here is the script and thank you for any help you may suggest.

#!/usr/bin/perl
#print "Content-type: text/html\n\n";
use LWP::Simple;
use HTML::HeadParser;
open (OUTFILE, '>outfile.txt');
open (MYFILE, 'url3.txt');
foreach $line (<MYFILE>) {
chomp($line);
$URL = get($line);
$Head = HTML::HeadParser->new;
$Head->parse("$URL");

print OUTFILE $Head->header('X-Meta-Description') . ".";
}
close(MYFILE);
close(OUTFILE);
exit;

Zhris · Aug 16, 2010

Hey,

Quote from LWP::Simple documentation:

You will not be able to examine the response code or response headers (like 'Content-Type') when you are accessing the web using this function. If you need that information you should use the full OO interface (see LWP::UserAgent).

I think you would need to examine the response code, therefore you should try using LWP::UserAgent instead:

http://search.cpan.org/~gaas/libwww-perl-5.836/lib/LWP/Simple.pm

In regards to creating threads, it would depend what system you are running the script on. I run similar scripts on a linux shared web server, but it is unable to handle looking through thousands of URLs.

Hope this helps.

Chris

Zhris · Aug 16, 2010

Wrong URL provided. Should be:

http://search.cpan.org/~gaas/libwww-perl-5.836/lib/LWP/UserAgent.pm

Annihilannic · Aug 16, 2010

See this thread for a simple example of using threads:

thread219-1600509

You can use the alarm() function to manage timeouts, see perldoc -f alarm for some example code.

Annihilannic.

bulgin · Aug 16, 2010

thanks for the information on LWP::UserAgent, I see that the is_success parameter will help me determine if I have a page. The problem I'm grappling with now is how to enumerate through the list such that when is_success fails, the foreach goes to the next line in the file. I'm stumped on that.

Thanks.

Zhris · Aug 16, 2010

Hey,

Use "next" (

http://perldoc.perl.org/functions/next.html)

e.g.:

Code:

foreach my $url (@urls)
	my $response = $ua->get($url);
	if ($response->is_success) {
		print $response->decoded_content;
	}
	else {
		next;
 	}
}

Chris

Zhris · Aug 16, 2010

Correction:

Code:

foreach my $url (@urls) {

bulgin · Aug 16, 2010

Thank you. I see the logic and understand the flow, but am having difficulty seeing where that code fits into my original code. I'm presuming there must be some changes needed to my original code for the new variable. Is that right?

I'm learning.

Thank you.

Zhris · Aug 16, 2010

This is untested, and i've also made improvements to your original code:

Code:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;
use HTML::HeadParser;
#print "Content-type: text/html\n\n";

my $ua = LWP::UserAgent->new;
$ua->timeout(10);

open my $out, '>', "outfile.txt" or die "cannot open out - $!";
open my $in, '>', "url3.txt" or die "cannot open in - $!";
while (my $url = <$in>) {
	chomp $url;
	my $response = $ua->get($url);
	my $head = HTML::HeadParser->new;
	$head->parse($response);
	print $out $head->header('X-Meta-Description') . ".";
}
close $in;
close $out;

exit;

Chris

Zhris · Aug 16, 2010

Ah, sorry, I forgot to include the most important part ;/:

Code:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;
use HTML::HeadParser;
#print "Content-type: text/html\n\n";

my $ua = LWP::UserAgent->new;
$ua->timeout(10);

open my $out, '>', "outfile.txt" or die "cannot open out - $!";
open my $in, '>', "url3.txt" or die "cannot open in - $!";
while (my $url = <$in>) {
	chomp $url;
	my $response = $ua->get($url);
	next unless ($response->is_success);
	my $head = HTML::HeadParser->new;
	$head->parse($response);
	print $out $head->header('X-Meta-Description') . ".";
}
close $in;
close $out;

exit;

Chris

Zhris · Aug 16, 2010

Really wish there were an edit button! Last fix:

">" should be "<"

Code:

open my $in, '<', "url3.txt" or die "cannot open in - $!";

Chris

bulgin · Aug 16, 2010

That now wipes out the content of url3.txt (@_@)

bulgin · Aug 16, 2010

Your last correction stopped url3.txt from being wiped out but the outfile.txt only contains a string of period marks like this:

....................

about that many.

Hmmmm....

Zhris · Aug 16, 2010

What your printing to the file is:

$head->header('X-Meta-Description') . ".";

It looks as if this method is returning an empty string with a . on the end. What were you expecting this method to return?

This is LWP::UserAgent's alternative:
print $out $ua->default_header('X-Meta-Description') . "\n";

Or just to test what URLs were are successful:
print $out "$url\n";

Chris

bulgin · Aug 16, 2010

The original script, given a list of URLs from url3.txt, one per line, printed out the meta-description of the website for each line.

That worked. Using your script and the same input file url3.txt, I'm getting a string of dots.....

bulgin · Aug 16, 2010

the line:

next unless ($response->is_success);

would that make it NOT go to the next line upon success?

It needs to jump to the next line if NOT success.

Zhris · Aug 16, 2010

I'm unable to test HTML::HeadParser; which is where your problem resides, therefore I will probably be of little help regarding HTML::HeadParser. The following should show what parts of the header were found:

Code:

$head->parse("$response");
foreach (keys %{$head->header()}) {
	print "$_\n";	
}

Chris

Zhris · Aug 16, 2010

next unless ($response->is_success);

= go to the next line UNLESS it is successful.

In other words:

= go to the next line IF it is not successful.

You can see which URLs were successful by printing $url (print "$url\n") directly after this statement.

Chris

bulgin · Aug 16, 2010

Thank, Chris for your help and patience. Yes, the $url (print "$url\n") does indeed print out the successfully connected URLs.

My purpose in the script, although it may not have been obvious as I am a newbie as I'm sure you know by now and just learning, was to extract the meta-keywords from the successfully connected webites and save those keywords into the outfile.txt

The original script I provided at the top of the posting did that but would hang up on a fail connect. Thus I turned to the LWP::UserAgent; which includes a nifty time-out feature. But it seems I am unable to retrieve head data (descriptions, keywords, title) using that method.
T

Zhris · Aug 16, 2010

Its odd that you are recieving different output. There are no real differences between our scripts, and $response is filled with identical content, which is what HTML::HeadParser is parsing. Something like print $head->header('x-meta-Keywords'); should print the meta keywords with no problem.

Hope you figure it out,

Chris

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Need time-out procedure for HTML perl procedure 1

IS-IT--Management

Programmer

Programmer

MIS

IS-IT--Management

Programmer

Programmer

IS-IT--Management

Programmer

Programmer

Programmer

IS-IT--Management

IS-IT--Management

Programmer

IS-IT--Management

IS-IT--Management

Programmer

Programmer

IS-IT--Management

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor