Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Retrieving the source of a webpage 1

Status
Not open for further replies.

Supra

Programmer
Dec 6, 2000
422
0
0
US
I've tried using LWP to get the source of a webpage, but I keep getting an error. This makes me think my server doesn't support LWP (Netfirms.com). Is there any other way to retrieve the source from a webpage other than LWP? All I need is something simple like:

$URL = "$HTM = GET blah blah blah

I'm new to Perl so any advice one can give would be great.
 
Can you give us an example of the code you've tried?

--Paul
 
Is it Google you've tried to access by any chance?


Kind Regards
Duncan
 
the reason i ask is because Google checks to see if you are using a browser or not - if you do not set this to fool Google then it will throw up an error - it knows it is being interrogated by a program

this fools it though...

Code:
use LWP;

$browser = LWP::UserAgent->new();

$find = "perl";

$response = $browser->get(
  "[URL unfurl="true"]http://www.google.com/search?hl=en&ie=ISO-8859-1&oe=ISO-8859-1&q=$find",[/URL]
  [red]'User-Agent' => 'Mozilla/4.76 [en] (win-98; U)'[/red],
);

print $response->content . "\n";


Kind Regards
Duncan
 
Duncan is partly right-- if it's Google you'll have a hard tiem parsing their pages. Actually, without setting up an API account with Google, there's no way you'll trick the browser into giving you the source code.

No matter how many times you try, it's not going to work and eventually you'll need to setup an API account for free.

cgimonk
 
No dice yet folks. Here's the code:
Code:
#!/usr/bin/perl

use LWP;

$browser = LWP::UserAgent->new();

$response = $browser->get(
  "[URL unfurl="true"]http://www.yahoo.com",[/URL]
  'User-Agent' => 'Mozilla/4.76 [en] (win-98; U)',
);

print $response->content . "\n";
I get the same 500 internal server error. The only thing I can think is that Netfirms doesn't allow LWP? You guys know better than me though ;)
 
I decided to run a script to check which modules are installed. Here's the output:
Code:
This site is hosted by Netfirms Web Hosting  

B
ByteLoader
Config
DB_File
DynaLoader
Errno
Fcntl
IO
NDBM_File
O
Opcode
POSIX
SDBM_File
Safe
Socket
XSLoader
attrs
ops
re
Sys::Hostname
Sys::Syslog
IPC::Msg
IPC::Semaphore
IPC::SysV
IO::Dir
IO::File
IO::Handle
IO::Pipe
IO::Poll
IO::Seekable
IO::Select
IO::Socket
File::Glob
Devel::DProf
Devel::Peek
Data::Dumper
B::Asmdata
B::Assembler
B::Bblock
B::Bytecode
B::C
B::CC
B::Concise
B::Debug
B::Deparse
B::Disassembler
B::Lint
B::Showlex
B::Stackobj
B::Stash
B::Terse
B::Xref
AnyDBM_File
AutoLoader
AutoSplit
Benchmark
CGI
CPAN
Carp
Cwd
DB
DirHandle
Dumpvalue
English
Env
Exporter
Fatal
FindBin
FileCache
FileHandle
SelectSaver
SelfLoader
Shell
Symbol
Test
UNIVERSAL
attributes
autouse
base
blib
bytes
charnames
constant
diagnostics
fields
filetest
integer
less
lib
locale
open
overload
sigtrap
strict
subs
utf8
vars
warnings
warnings::register
User::grent
User::pwent
Time::Local
Time::gmtime
Time::localtime
Time::tm
Tie::Array
Tie::Handle
Tie::Hash
Tie::RefHash
Tie::Scalar
Tie::SubstrHash
Text::Abbrev
Text::ParseWords
Text::Soundex
Text::Tabs
Text::Wrap
Test::Harness
Term::ANSIColor
Term::Cap
Term::Complete
Term::ReadLine
Search::Dict
Pod::Checker
Pod::Find
Pod::Functions
Pod::Html
Pod::InputObjects
Pod::LaTeX
Pod::Man
Pod::ParseUtils
Pod::Parser
Pod::Plainer
Pod::Select
Pod::Text
Pod::Usage
Pod::Text::Color
Pod::Text::Overstrike
Pod::Text::Termcap
Net::Ping
Net::hostent
Net::netent
Net::protoent
Net::servent
Math::BigFloat
Math::BigInt
Math::Complex
Math::Trig
IPC::Open2
IPC::Open3
IO::Socket::INET
IO::Socket::UNIX
I18N::Collate
Getopt::Long
Getopt::Std
File::Basename
File::CheckTree
File::Compare
File::Copy
File::DosGlob
File::Find
File::Path
File::Spec
File::Temp
File::stat
File::Spec::Epoc
File::Spec::Functions
File::Spec::Mac
File::Spec::OS2
File::Spec::Unix
File::Spec::VMS
File::Spec::Win32
ExtUtils::Command
ExtUtils::Embed
ExtUtils::Install
ExtUtils::Installed
ExtUtils::Liblist
ExtUtils::MM_Cygwin
ExtUtils::MM_OS2
ExtUtils::MM_Unix
ExtUtils::MM_VMS
ExtUtils::MM_Win32
ExtUtils::MakeMaker
ExtUtils::Manifest
ExtUtils::Miniperl
ExtUtils::Mkbootstrap
ExtUtils::Mksymlists
ExtUtils::Packlist
ExtUtils::testlib
Exporter::Heavy
Devel::SelfStubber
Class::Struct
Carp::Heavy
CPAN::FirstTime
CPAN::Nox
CGI::Apache
CGI::Carp
CGI::Cookie
CGI::Fast
CGI::Pretty
CGI::Push
CGI::Switch
CGI::Util
B
ByteLoader
Config
DB_File
DynaLoader
Errno
Fcntl
IO
NDBM_File
O
Opcode
POSIX
SDBM_File
Safe
Socket
XSLoader
attrs
ops
re
Sys::Hostname
Sys::Syslog
IPC::Msg
IPC::Semaphore
IPC::SysV
IO::Dir
IO::File
IO::Handle
IO::Pipe
IO::Poll
IO::Seekable
IO::Select
IO::Socket
File::Glob
Devel::DProf
Devel::Peek
Data::Dumper
B::Asmdata
B::Assembler
B::Bblock
B::Bytecode
B::C
B::CC
B::Concise
B::Debug
B::Deparse
B::Disassembler
B::Lint
B::Showlex
B::Stackobj
B::Stash
B::Terse
B::Xref
mod_perl
::usr::local::nf::lib::perl5::site_perl::5.6.1::i386-freebsd::mod_perl_hooks.pm.PL
mod_perl_hooks
Apache
DBI
Mysql
Digest::MD5
Mysql::Statement
DBD::Proxy
DBD::NullP
DBD::ADO
DBD::ExampleP
DBD::Sponge
DBD::Multiplex
DBD::mysql
DBI::ProxyServer
DBI::Format
DBI::Shell
DBI::FAQ
DBI::W32ODBC
DBI::DBD
Win32::DBIODBC
XML::Parser
XML::Parser::Expat
Bundle::Apache
Bundle::DBI
Bundle::DBD::mysql
Apache::Registry
Apache::PerlSections
Apache::PerlRun
Apache::Debug
Apache::MyConfig
Apache::ExtUtils
Apache::src
Apache::Symdump
Apache::Status
Apache::RedirectLogFix
Apache::Include
Apache::StatINC
Apache::RegistryBB
Apache::test
Apache::FakeRequest
Apache::SizeLimit
Apache::Resource
Apache::RegistryNG
Apache::httpd_conf
Apache::SIG
Apache::Options
Apache::Opcode
Apache::RegistryLoader
Apache::Connection
Apache::Constants
Apache::File
Apache::Leak
Apache::Log
Apache::ModuleConfig
Apache::PerlRunXS
Apache::Server
Apache::Symbol
Apache::Table
Apache::URI
Apache::Util
Apache::Constants::Exports
URI
MD5
mod_perl
::usr::local::nf::lib::perl5::site_perl::5.6.1::i386-freebsd::mod_perl_hooks.pm.PL
mod_perl_hooks
Apache
DBI
Mysql
Digest::MD5
Mysql::Statement
DBD::Proxy
DBD::NullP
DBD::ADO
DBD::ExampleP
DBD::Sponge
DBD::Multiplex
DBD::mysql
DBI::ProxyServer
DBI::Format
DBI::Shell
DBI::FAQ
DBI::W32ODBC
DBI::DBD
Win32::DBIODBC
XML::Parser
XML::Parser::Expat
Bundle::Apache
Bundle::DBI
Bundle::DBD::mysql
Apache::Registry
Apache::PerlSections
Apache::PerlRun
Apache::Debug
Apache::MyConfig
Apache::ExtUtils
Apache::src
Apache::Symdump
Apache::Status
Apache::RedirectLogFix
Apache::Include
Apache::StatINC
Apache::RegistryBB
Apache::test
Apache::FakeRequest
Apache::SizeLimit
Apache::Resource
Apache::RegistryNG
Apache::httpd_conf
Apache::SIG
Apache::Options
Apache::Opcode
Apache::RegistryLoader
Apache::Connection
Apache::Constants
Apache::File
Apache::Leak
Apache::Log
Apache::ModuleConfig
Apache::PerlRunXS
Apache::Server
Apache::Symbol
Apache::Table
Apache::URI
Apache::Util
Apache::Constants::Exports
URI::Escape
URI::Heuristic
URI::URL
URI::WithBase
URI::_foreign
URI::_generic
URI::_login
URI::_query
URI::_segment
URI::_server
URI::_userpass
URI::data
URI::file
URI::ftp
URI::gopher
URI::http
URI::https
URI::ldap
URI::mailto
URI::news
URI::nntp
URI::pop
URI::rlogin
URI::rsync
URI::snews
URI::telnet
URI::file::Base
URI::file::FAT
URI::file::Mac
URI::file::OS2
URI::file::QNX
URI::file::Unix
URI::file::Win32
NF::XML::Comm
NF::XML::Conv
URI
MD5
mod_perl
::usr::local::nf::lib::perl5::site_perl::5.6.1::i386-freebsd::mod_perl_hooks.pm.PL
mod_perl_hooks
Apache
DBI
Mysql
Digest::MD5
Mysql::Statement
DBD::Proxy
DBD::NullP
DBD::ADO
DBD::ExampleP
DBD::Sponge
DBD::Multiplex
DBD::mysql
DBI::ProxyServer
DBI::Format
DBI::Shell
DBI::FAQ
DBI::W32ODBC
DBI::DBD
Win32::DBIODBC
XML::Parser
XML::Parser::Expat
Bundle::Apache
Bundle::DBI
Bundle::DBD::mysql
Apache::Registry
Apache::PerlSections
Apache::PerlRun
Apache::Debug
Apache::MyConfig
Apache::ExtUtils
Apache::src
Apache::Symdump
Apache::Status
Apache::RedirectLogFix
Apache::Include
Apache::StatINC
Apache::RegistryBB
Apache::test
Apache::FakeRequest
Apache::SizeLimit
Apache::Resource
Apache::RegistryNG
Apache::httpd_conf
Apache::SIG
Apache::Options
Apache::Opcode
Apache::RegistryLoader
Apache::Connection
Apache::Constants
Apache::File
Apache::Leak
Apache::Log
Apache::ModuleConfig
Apache::PerlRunXS
Apache::Server
Apache::Symbol
Apache::Table
Apache::URI
Apache::Util
Apache::Constants::Exports
URI::Escape
URI::Heuristic
URI::URL
URI::WithBase
URI::_foreign
URI::_generic
URI::_login
URI::_query
URI::_segment
URI::_server
URI::_userpass
URI::data
URI::file
URI::ftp
URI::gopher
URI::http
URI::https
URI::ldap
URI::mailto
URI::news
URI::nntp
URI::pop
URI::rlogin
URI::rsync
URI::snews
URI::telnet
URI::file::Base
URI::file::FAT
URI::file::Mac
URI::file::OS2
URI::file::QNX
URI::file::Unix
URI::file::Win32
As you can see, LPW is not in this list. However, IO::Socket is. Now then, if I'm just trying to grab the source of a webpage to get a few values from it, which from the list above would be the best way to go about it?
 
I copied this script right from a website (DevArticles) and it gives the same 500 Internal Server Error:
Code:
#!/usr/bin/perl -w

use strict;
use Socket;

# initialize host and port 
my $host = shift || '[URL unfurl="true"]http://www.yahoo.com';[/URL]
my $port = shift || 80;
my $proto = getprotobyname('tcp');

# get the port address 
my $iaddr = inet_aton($host);
my $paddr = sockaddr_in($port, $iaddr);

# create the socket, connect to the port 
socket(SOCKET, PF_INET, SOCK_STREAM, $proto);
connect(SOCKET, $paddr);

my $line;

while ($line = ) {
	print $line;
}

close SOCKET;
What must I do to get the source of a webpage?!
 
That's really intersting because LWP comes with Perl, I don't see why they wouldn't allow it. If they didn't have LWP and you tried to use it, you WOULD get an error saying the module cannot be located or found, not just a 500 ISE error.

Check your error logs and switch on warnings if they aren't on.

This is what I personally use to parse the code from a page. With minor, if any, error handling.

Code:
#!/usr/bin/perl

use warnings;
use strict;

use CGI qw/:standard/;
use LWP::Simple;

print header, start_html();

my $to_get = "[URL unfurl="true"]www.sulfericacid.com";[/URL]
my @caught = get($to_get);

foreach (@caught)
{
  print;
}
 
I copied and pasted your source exactly into a new file, but it still gives me the 500. Here's the URL:


At this point I can only assume that Netfirms has some sort of block against these kinds of modules. I guess it's time to give up. Thanks for your continued effort guys.
 
Before you give up, can you put this near the top of your script and let us know if it gives you an error?

Code:
use CGI::Carp 'fatalsToBrowser';

cgimonk
 
Yup:
Code:
Can't locate LWP/Simple.pm in @INC (@INC contains: /usr/local/nf/lib/perl5/5.6.1/i386-freebsd /usr/local/nf/lib/perl5/5.6.1 /usr/local/nf/lib/perl5/site_perl/5.6.1/i386-freebsd /usr/local/nf/lib/perl5/site_perl/5.6.1 /usr/local/nf/lib/perl5/site_perl .) at test.pl line 8.
BEGIN failed--compilation aborted at test.pl line 8.
Just as I thought - no LWP :(
 
Yes, you were totally right. That is very odd, it comes with all Packages from 5.6+ (and further back then that I believe). So I have no idea why it's not there.

cgimonk
 
You can still install it in your own lcal directory, and reference it differently

How? I can't remember, though I've seen it this forum, ahve a look at the FAQ's

--Paul
 
you can do it with IO::Socket but it's not nearly as good

Code:
use IO::Socket;

$socket = IO::Socket::INET->new
(
  Proto    => "tcp",
  PeerAddr => "[URL unfurl="true"]www.yahoo.com",[/URL]
  PeerPort => 80,
);

$socket->autoflush(1);

print $socket "GET /index.html HTTP/1.0\015\012\015\012";

while (<$socket>) {
  print;
}
close $socket;


Kind Regards
Duncan
 
I don't understand why you are having such bad luck... I would try and install LWP locally as Paul suggested - I have not done this before so I can't help i'm afraid

Even better - contact netfirms and ask them to install LWP - apparently it is very quick and easy


Kind Regards
Duncan
 
Ok I've installed LWP locally as suggested by PaulTEG. Now I don't get any errors but nothing prints out on the page except "Printing source..". I'm using cgimonk's code, slightly modified:
Code:
#!/usr/bin/perl

use CGI::Carp 'fatalsToBrowser';

use warnings;
use strict;

use lib '/mnt/web_i/d26/s15/b01eb8dc/[URL unfurl="true"]www/modules/';[/URL]
use CGI qw/:standard/;
use LWP::Simple;

print header, start_html();

my $to_get = "[URL unfurl="true"]www.yahoo.com";[/URL]
my @caught = get($to_get);

print "Printing source..\n";

foreach (@caught)
{
  print;
}
I can only assume it isn't printing the correct variable? I'm getting closer! ;)
 
Don't ask me why, but you have to pass the ' before the web address and then it works

HTH
--Paul
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top