Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cant extract links!!

Status
Not open for further replies.

NashTrump

Technical User
Jul 23, 2005
38
GB
Hi there,

Im extracing links from two betting websites.
One of them i can do with the code i have written, however the other doesnt return any data when i parse the website.

two two websites which are the $url's are:


heres my code:

#!/usr/perl/site/lib -w

use Date::Calc qw(Today);
use LWP::UserAgent;
use HTML::LinkExtor;
use DBI;
use Switch;
use URI::URL;
use Strict;
use Warnings;
my $sql;
my $statement;
my $statement_main;
my $siteid;
my $url;
my $title;
my $dateage;
my $siteId;


$sql = "SELECT * FROM bookies";
$statement_main = AccessDatabase($sql);

$fields = $statement_main->{NUM_OF_FIELDS};
while (my $ref = $statement_main->fetchrow_arrayref) {
for (my $i=0; $i < $fields; $i++) {
switch ($i){
case 0 {$siteid = $$ref[$i]}
case 1 {$url = $$ref[$i]}
case 2 {$title = $$ref[$i]}
}
print "$i = $$ref[$i] ";
}

#$url = " # for instance
my $ua = LWP::UserAgent::proxyAny->new;
$ua->env_proxy;

# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'a'; # we only look closer at <img ...>
push(@imgs, values %attr);
}

# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
my $p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});


# Expand all image URLs to absolute ones
#my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;

# Print them out
@imgs = grep(/sID/, @imgs);

print join("\n", @imgs), "\n";

foreach (@imgs) {
$sql = "insert into Links(url, title) values ('$_', '$title')";
$statement = AccessDatabase($sql);
}

}

sub AccessDatabase{
($sql)= @_;

my $dbh = DBI->connect("dbi:mysql:database=mybetting;host=localhost;user=root;password=password")
or die "Couldn't connect to database: $DBI::errstr\n";

$DBI::result = $dbh->prepare($sql);
$DBI::result->execute() or die "Couldn't execute query '$sql': $DBI::errstr\n";

$dbh->disconnect();
return $DBI::result;
};

******************************

Ive been looking at this for about a month and cant sort it out!!

If anyone knows a way to do this let me know!!

Kind regards

Nash
 
try this

Code:
use LWP::Simple qw/get $ua/;
use HTML::Parser;
$ua->agent( "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" );
my $data = get( '[URL unfurl="true"]http://google.com/'[/URL] );
HTML::Parser->new( start_h => [sub { print shift->{href}, "\n" if ( shift eq 'a' ) }, 'tagname, attr'],  )->parse( $data ) || die $!;

HTH

---
cheers!
san
smoking3vc.gif


print length "The answer to life, universe & everything!
 
That was great...

Now all i have to do is work out your script!! (I'm fairly basic at perl at the moment)

It worked perfectly though!!

Thanks very much!! Your a star.....
 
One more thing though...

Instead of printing them out how do i add it to an array called @links?

I tried replacing the print shift->(href) with push (@links values shift->(href))

Obviously i dont know what im talking about cause that didnt work!! :)

any idea how to do it?

Thanks

nash
 
yep for that you need to make that into a global array. Append the domains using a pattern match instead of hardcoding it.. so while pushing in the values, if the link doesn't start with a ^http:// then push it like " where $dom holds the value of $1 of your match.

HTH

---
cheers!
san
smoking3vc.gif


print length "The answer to life, universe & everything!
 
Ok got that sorted....
Thanks for that...

Any idea how i add the link text to my url?

IE: if it said Match betting... i would like to extrac that along with my ' for example...

Im really unfamilar with HTML::parser and theres no where on the cpan explanation that explains it properly...

Thanks if you can explain this to me..
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top