Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need time-out procedure for HTML perl procedure 1

Status
Not open for further replies.

bulgin

IS-IT--Management
Mar 17, 2005
13
US
I have the following perl script which works nicely - it imports a list of urls, goes out to them, grabs the Head data and writes it to a file. Problem is, if it encounters a domain that is unresponsive or takes long to load, it halts and just sits there. I'm wondering if there is a way to make the script go to the next line if it's having difficulty connecting to the domain.

I'm also wondering if it is possible to create "threads" like the big-shot developers do? And in the extreme, this script may be looking at a list of thousands of URLs, so does anyone see any problem with this script handling that much overhead and not crashing out?

Here is the script and thank you for any help you may suggest.


#!/usr/bin/perl
#print "Content-type: text/html\n\n";
use LWP::Simple;
use HTML::HeadParser;
open (OUTFILE, '>outfile.txt');
open (MYFILE, 'url3.txt');
foreach $line (<MYFILE>) {
chomp($line);
$URL = get($line);
$Head = HTML::HeadParser->new;
$Head->parse("$URL");

print OUTFILE $Head->header('X-Meta-Description') . ".";
}
close(MYFILE);
close(OUTFILE);
exit;
 
Hey,

Quote from LWP::Simple documentation:
You will not be able to examine the response code or response headers (like 'Content-Type') when you are accessing the web using this function. If you need that information you should use the full OO interface (see LWP::UserAgent).

I think you would need to examine the response code, therefore you should try using LWP::UserAgent instead:


In regards to creating threads, it would depend what system you are running the script on. I run similar scripts on a linux shared web server, but it is unable to handle looking through thousands of URLs.

Hope this helps.

Chris
 
See this thread for a simple example of using threads:

thread219-1600509

You can use the alarm() function to manage timeouts, see perldoc -f alarm for some example code.

Annihilannic.
 
thanks for the information on LWP::UserAgent, I see that the is_success parameter will help me determine if I have a page. The problem I'm grappling with now is how to enumerate through the list such that when is_success fails, the foreach goes to the next line in the file. I'm stumped on that.

Thanks.
 
Thank you. I see the logic and understand the flow, but am having difficulty seeing where that code fits into my original code. I'm presuming there must be some changes needed to my original code for the new variable. Is that right?

I'm learning.

Thank you.
 
This is untested, and i've also made improvements to your original code:

Code:
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
use HTML::HeadParser;
#print "Content-type: text/html\n\n";

my $ua = LWP::UserAgent->new;
$ua->timeout(10);

open my $out, '>', "outfile.txt" or die "cannot open out - $!";
open my $in, '>', "url3.txt" or die "cannot open in - $!";
while (my $url = <$in>) {
	chomp $url;
	my $response = $ua->get($url);
	my $head = HTML::HeadParser->new;
	$head->parse($response);
	print $out $head->header('X-Meta-Description') . ".";
}
close $in;
close $out;

exit;

Chris
 
Ah, sorry, I forgot to include the most important part ;/:

Code:
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
use HTML::HeadParser;
#print "Content-type: text/html\n\n";

my $ua = LWP::UserAgent->new;
$ua->timeout(10);

open my $out, '>', "outfile.txt" or die "cannot open out - $!";
open my $in, '>', "url3.txt" or die "cannot open in - $!";
while (my $url = <$in>) {
	chomp $url;
	my $response = $ua->get($url);
	next unless ($response->is_success);
	my $head = HTML::HeadParser->new;
	$head->parse($response);
	print $out $head->header('X-Meta-Description') . ".";
}
close $in;
close $out;

exit;

Chris
 
Really wish there were an edit button! Last fix:

">" should be "<"

Code:
open my $in, '<', "url3.txt" or die "cannot open in - $!";

Chris
 
That now wipes out the content of url3.txt (@_@)
 
Your last correction stopped url3.txt from being wiped out but the outfile.txt only contains a string of period marks like this:

....................


about that many.

Hmmmm....
 
What your printing to the file is:

$head->header('X-Meta-Description') . ".";

It looks as if this method is returning an empty string with a . on the end. What were you expecting this method to return?

This is LWP::UserAgent's alternative:
print $out $ua->default_header('X-Meta-Description') . "\n";

Or just to test what URLs were are successful:
print $out "$url\n";

Chris

 
The original script, given a list of URLs from url3.txt, one per line, printed out the meta-description of the website for each line.

That worked. Using your script and the same input file url3.txt, I'm getting a string of dots.....
 
the line:

next unless ($response->is_success);


would that make it NOT go to the next line upon success?

It needs to jump to the next line if NOT success.
 
I'm unable to test HTML::HeadParser; which is where your problem resides, therefore I will probably be of little help regarding HTML::HeadParser. The following should show what parts of the header were found:

Code:
$head->parse("$response");
foreach (keys %{$head->header()}) {
	print "$_\n";	
}

Chris
 
next unless ($response->is_success);

= go to the next line UNLESS it is successful.

In other words:

= go to the next line IF it is not successful.

You can see which URLs were successful by printing $url (print "$url\n") directly after this statement.

Chris
 
Thank, Chris for your help and patience. Yes, the $url (print "$url\n") does indeed print out the successfully connected URLs.

My purpose in the script, although it may not have been obvious as I am a newbie as I'm sure you know by now and just learning, was to extract the meta-keywords from the successfully connected webites and save those keywords into the outfile.txt

The original script I provided at the top of the posting did that but would hang up on a fail connect. Thus I turned to the LWP::UserAgent; which includes a nifty time-out feature. But it seems I am unable to retrieve head data (descriptions, keywords, title) using that method.
T
 
Its odd that you are recieving different output. There are no real differences between our scripts, and $response is filled with identical content, which is what HTML::HeadParser is parsing. Something like print $head->header('x-meta-Keywords'); should print the meta keywords with no problem.

Hope you figure it out,

Chris
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top