Problem with LWP::UserAgent when retrieve page...

JimToupet · Oct 20, 2005

Hi,

I use LWP::UserAgent to "grab" a web page put it in a temp file and parsing some data with regular expression.

But every page that I grab have not the same format that if I check the source code of the page. On every line I've at least one "sqare" (when I check with Notepad).

So my regular expression doesn't work anymore. I really don't know what to do to solve this issue.

There's my code :

Code:

use LWP::UserAgent;
use HTTP::Request;
use strict;
$|++;

my $ua = LWP::UserAgent->new(agent=>'Mozilla/5.0 (Windows; U; Windows NT 5.1; fr-FR; rv:1.7.12) Gecko/20050919 Firefox/1.0.7');

$ua->default_headers->header('Content-Type' => 'text/html');
my $req = new HTTP::Request('GET', '[URL unfurl="true"]http://www.yahoo.com')[/URL] || die ("$! $url_part");
my $res = $ua->request($req, 'Temp.html');

Thanks

Patrick

Thorne44 · Oct 20, 2005

I usually find that notepad doesn't understand \n\r
You could try replacing all cases on \r with nothing
That would get rid of the boxes in notepad I suspect

Bruce

icrf · Oct 20, 2005

Well, it's definitely some kind of extraneous character. A windows newline is "\r\n" so stripping out all \r's would likey cause line endings to go screwy on you. I generally see that when you try to read mac (\r only) or *nix (\n only) files, though it could be any character not in the 7-bit ASCII set notepad expects(?).

I'd first reccomend updating your text editor, and maybe check it out with a hex editor to see what the characters actually are, then you can strip out or work around as your needs direct.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

JimToupet · Oct 20, 2005

Hi,

Thank you guys.

I use this line to try to remove those weird character. But that's not work... maybe you can help for this.

Code:

$line =~ s/\r*\n$//g;

Somewhere where I'm confuse... when I use IE and check the source code of a web site Notepad is open. Why I can see everything correctly but when I "grab" the same is different? Nothing to do with the transfer with LWP ? Nothing that can be set (ex. header) to correct this ? I'm new using this and trying to understand.

icrf · Oct 20, 2005

maybe try the other way

s/\n/\r\n/g

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

JimToupet · Oct 20, 2005

Thanks icrf !

That work very well with this modification!

Once again, thanks a lot!

Patrick

fishiface · Oct 21, 2005

You've converted the downloaded file from *nix format to dos format and made it look prettier in notepad but it isn't necessary to do all that expensive substitution simply in order to parse the data.

As HTML is space-blind to a large extent (mixed sequences of line-feeds, form-feeds, carriage-returns spaces and tabs are all treated as equivalent to a single space) you should use the /s modifier on your regexes. This lets your patterns span multiple lines (sa your HTML might).

At this point, the \s token will match \r and \n as well as the other whitespace characters, and the substitution discussed earlier in the thread will not make any difference to the result.

Yours,

fish

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Problem with LWP::UserAgent when retrieve page...

JimToupet

Programmer

Thorne44

Programmer

icrf

Programmer

JimToupet

Programmer

icrf

Programmer

JimToupet

Programmer

fishiface

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor