Cant get Content from URL!

NashTrump · Jun 29, 2006

hi Guys,

Im trying to look at the html behind a website by using $mech->get($url) command using the use

http://WWW::Mechanize;

Below is the code im using and an example output.

$mech->get($url);
$mech->success or die "Can't open page\n";
$content = $mech->content;
print $url;
print $content;

http://www.betfred.com/betting/?sID=194.1&nIDMK=245574.1&fbf=match

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

"

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="

http://www.w3.org/1999/xhtml">

<head>

<title>Betting</title>

<script type="text/javascript">

/*<![CDATA[*/

function createCookie(name,value,minutes) {

if (minutes) {

var date = new Date();

date.setTime(date.getTime()+(minutes*60*1000));

var expires = "; expires="+date.toGMTString();

}

else var expires = "";

document.cookie = name+"="+value+expires+"; path=/";

}

/*]]>*/

</script>

</head>

<body>

<script type="text/javascript">

/*<![CDATA[*/

var nRfrPos = location.search.indexOf("saarfr=");

var nCntPos = location.search.indexOf("saacnt=");

var sRfr = location.search.substring(nRfrPos+7, nCntPos-1);

var nCnt = parseInt(location.search.substring(nCntPos+7), 10);

if (isNaN(nCnt)) nCnt = 0;

var sHref;

if (nRfrPos != -1) {

createCookie('jbhckrFix', 'fInSoFt', 1440);

createCookie('ok2prcd', '1', 5); // 5min should be enough

sHref = unescape( sRfr );

if (sHref.indexOf("?") == -1) sHref += "?"; else sHref += "&";

sHref += "saacnt=" + (nCnt+1);

location.href = sHref;

} else {

sHref = location.href.toLowerCase();

if (sHref.indexOf("index.html") != -1)

sHref = location.href.replace("index.html","index.asp");

else {

var sTmp = location.host + "/betting/";

sHref = location.href.replace(sTmp, sTmp + "index.asp");

}

if (sHref.indexOf("?") == -1) sHref += "?"; else sHref += "&";

sHref += "saacnt=" + (nCnt+1);

location.href = sHref;

}

/*]]>*/

</script>

</body>

</html>

However the actual source on the website link is nothing like this!!

Does anyone know what i am doing wrong?

Kind regards

Nash

NashTrump · Jun 29, 2006

Hi there,

i also sometimes get this:

HTTP::Response=HASH(0x2a0c32c)

have absolutly no idea what this means!!

Regards

nash

raklet · Jun 29, 2006

That is because the HTML is generated by executing javascript commands. The HTML does not come from the server (JS only). Once the JS hits the browser, the browser executes the code and produces the HTML.

See

http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize/FAQ.pod#JavaScript

for a full description of the problem. This is not a limitation inherent to Mechanize only. It can also be found in LWP, HTML::TokeParser::Simple and numerous other modules that are designed for connecting to and parsing the web. AFAIK there is nothing written yet to handle this.

Raklet

NashTrump · Jun 29, 2006

So in short is there anything i can do to guarentee the return of the HTML?

I can use the other modules, however when using my current version i am getting the html every now and then however not every time...

raklet · Jun 30, 2006

I'm sorry, but I can't really say. At this point I am foundering in deep water. Do a google search for "lwp javascript" - you may find something of interest. There are lots of articles that point to other things, but I never found any definitive answers - just lots of little hints that require more reading and more exploring. Two possibilities that seemed to stand out:

http://www.openqa.org/selenium/

http://search.cpan.org/dist/Win32-IE-Mechanize/lib/Win32/IE/Mechanize.pm

Not much more I can say about this topic.

Raklet

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Cant get Content from URL!

NashTrump

Technical User

NashTrump

Technical User

raklet

MIS

NashTrump

Technical User

raklet

MIS

Similar threads

Part and Inventory Search

Sponsor