You should be able to see what im trying to do here, i want to generate anonymous hash references for each key, which is a url; each anonymous hash will then contain a list of keys for all the urls extracted from the page, each of which then opens up into another anonymous hash for the links on that page, so on add infinitum. The result should be a tree like structure mapping out all the urls crawled by the bot. Please help, this is urgent as I need to impress my new boss
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;
my %url = (
" => {}
);
$p = HTML::LinkExtor->new();
while (($key,$value) = each(%url))
{my $content = get($key);
$p->parse($content)->eof;
@links = $p->links;
foreach $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}
foreach $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http/g)
{push (@links2, $link1);}
}
@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);
$value = {};
foreach $link2 (@links2)
{$value = {$link2 => {}};}
print %url;
};
also, eventualy i wanna store this script on a server so ajax can make requests and build single columned tables from the url values, each with an arrow next to it, when u click on the arrow it will bring up the corresponding list of urls extracted from that page, and of course each table will have a "back" button bellow it, and probably a "back to base URL" button aswell. Also the user will be able to innitiate the base url and the number of urls crawled, i managed to do this easily with a simpler robot with a single dimensional URL array, where new URLs were simply pushed into the list, like this
while (scalar(@URL)<100)
{#perform crawl}
But i get the feelin its gonna be alot more difficult with this multidimensional bot. perhaps I will have to set crawl depth (number of dimensions in hash) as opposed to number of URLs, but then it will be almost imposible to determine how many URLs will be extracted, and therefor how long the robot will be crawling for. Perhaps I will be able to make a log of every URL in an array as well as using them as hash keys, and then the array will determine when the while loop stops as with my first effort.
I guess these are secondary considerations to actualy getting the thing working, so first i need to know how to generate those new hashes!
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;
my %url = (
" => {}
);
$p = HTML::LinkExtor->new();
while (($key,$value) = each(%url))
{my $content = get($key);
$p->parse($content)->eof;
@links = $p->links;
foreach $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}
foreach $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http/g)
{push (@links2, $link1);}
}
@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);
$value = {};
foreach $link2 (@links2)
{$value = {$link2 => {}};}
print %url;
};
also, eventualy i wanna store this script on a server so ajax can make requests and build single columned tables from the url values, each with an arrow next to it, when u click on the arrow it will bring up the corresponding list of urls extracted from that page, and of course each table will have a "back" button bellow it, and probably a "back to base URL" button aswell. Also the user will be able to innitiate the base url and the number of urls crawled, i managed to do this easily with a simpler robot with a single dimensional URL array, where new URLs were simply pushed into the list, like this
while (scalar(@URL)<100)
{#perform crawl}
But i get the feelin its gonna be alot more difficult with this multidimensional bot. perhaps I will have to set crawl depth (number of dimensions in hash) as opposed to number of URLs, but then it will be almost imposible to determine how many URLs will be extracted, and therefor how long the robot will be crawling for. Perhaps I will be able to make a log of every URL in an array as well as using them as hash keys, and then the array will determine when the while loop stops as with my first effort.
I guess these are secondary considerations to actualy getting the thing working, so first i need to know how to generate those new hashes!