Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Problem with LWP::UserAgent when retrieve page...

Status
Not open for further replies.

JimToupet

Programmer
Oct 20, 2005
3
0
0
CA
Hi,

I use LWP::UserAgent to "grab" a web page put it in a temp file and parsing some data with regular expression.

But every page that I grab have not the same format that if I check the source code of the page. On every line I've at least one "sqare" (when I check with Notepad).

So my regular expression doesn't work anymore. I really don't know what to do to solve this issue.

There's my code :
Code:
use LWP::UserAgent;
use HTTP::Request;
use strict;
$|++;

my $ua = LWP::UserAgent->new(agent=>'Mozilla/5.0 (Windows; U; Windows NT 5.1; fr-FR; rv:1.7.12) Gecko/20050919 Firefox/1.0.7');

$ua->default_headers->header('Content-Type' => 'text/html');
my $req = new HTTP::Request('GET', '[URL unfurl="true"]http://www.yahoo.com')[/URL] || die ("$! $url_part");
my $res = $ua->request($req, 'Temp.html');

Thanks

Patrick
 
I usually find that notepad doesn't understand \n\r
You could try replacing all cases on \r with nothing
That would get rid of the boxes in notepad I suspect

Bruce
 
Well, it's definitely some kind of extraneous character. A windows newline is "\r\n" so stripping out all \r's would likey cause line endings to go screwy on you. I generally see that when you try to read mac (\r only) or *nix (\n only) files, though it could be any character not in the 7-bit ASCII set notepad expects(?).

I'd first reccomend updating your text editor, and maybe check it out with a hex editor to see what the characters actually are, then you can strip out or work around as your needs direct.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
Hi,

Thank you guys.

I use this line to try to remove those weird character. But that's not work... maybe you can help for this.

Code:
$line =~ s/\r*\n$//g;

Somewhere where I'm confuse... when I use IE and check the source code of a web site Notepad is open. Why I can see everything correctly but when I "grab" the same is different? Nothing to do with the transfer with LWP ? Nothing that can be set (ex. header) to correct this ? I'm new using this and trying to understand.
 
Thanks icrf !

That work very well with this modification!

Once again, thanks a lot!

Patrick
 
You've converted the downloaded file from *nix format to dos format and made it look prettier in notepad but it isn't necessary to do all that expensive substitution simply in order to parse the data.

As HTML is space-blind to a large extent (mixed sequences of line-feeds, form-feeds, carriage-returns spaces and tabs are all treated as equivalent to a single space) you should use the /s modifier on your regexes. This lets your patterns span multiple lines (sa your HTML might).

At this point, the \s token will match \r and \n as well as the other whitespace characters, and the substitution discussed earlier in the thread will not make any difference to the result.

Yours,

fish

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top