Robot: storing URLs in a multidimensional hash

Tules · Aug 31, 2008

You should be able to see what im trying to do here, i want to generate anonymous hash references for each key, which is a url; each anonymous hash will then contain a list of keys for all the urls extracted from the page, each of which then opens up into another anonymous hash for the links on that page, so on add infinitum. The result should be a tree like structure mapping out all the urls crawled by the bot. Please help, this is urgent as I need to impress my new boss

use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my %url = (
"

http://www.myspace.com"

=> {}
);

$p = HTML::LinkExtor->new();

while (($key,$value) = each(%url))

{my $content = get($key);

$p->parse($content)->eof;

@links = $p->links;

foreach $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

foreach $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http

/g)
{push (@links2, $link1);}
}
@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

$value = {};

foreach $link2 (@links2)
{$value = {$link2 => {}};}
print %url;
};

also, eventualy i wanna store this script on a server so ajax can make requests and build single columned tables from the url values, each with an arrow next to it, when u click on the arrow it will bring up the corresponding list of urls extracted from that page, and of course each table will have a "back" button bellow it, and probably a "back to base URL" button aswell. Also the user will be able to innitiate the base url and the number of urls crawled, i managed to do this easily with a simpler robot with a single dimensional URL array, where new URLs were simply pushed into the list, like this

while (scalar(@URL)<100)

{#perform crawl}

But i get the feelin its gonna be alot more difficult with this multidimensional bot. perhaps I will have to set crawl depth (number of dimensions in hash) as opposed to number of URLs, but then it will be almost imposible to determine how many URLs will be extracted, and therefor how long the robot will be crawling for. Perhaps I will be able to make a log of every URL in an array as well as using them as hash keys, and then the array will determine when the while loop stops as with my first effort.

I guess these are secondary considerations to actualy getting the thing working, so first i need to know how to generate those new hashes!

prex1 · Sep 1, 2008

perl lets you define hash of hashes of hashes of ... (theoretically) ad infinitum.
You should study perl's doc (particularly perldata and perllol, and also perlref, the latter being quite complex) to understand how to use them; however, in short, and using my favorite syntax among the various available:
-to create a new subhash: [tt]$links{aaa}{bbb}={};[/tt]
-to create a reference to the newly created subhash (useful for handling the subhash into a presumably recursive routine): [tt]my$refl1=\%{$links{aaa}{bbb}};[/tt]
-to add link 'ccc' to that referenced subhash: [tt]$$refl1{ccc}={};[/tt]
-to do something for all the keys of that referenced subhash [tt]for(keys%$refl1){}[/tt]
Note that you don't need to check for duplicates with hashes.
However you'll need to avoid closed circuits (didn't check your code to understand if you already resolved this issue), and I guess the only way is to keep also a list of all the urls used so far.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tules · Sep 1, 2008

Hey, sorry I'm finding it very difficult to get my head round this, how exactly would i generate annonymous hash references on the fly? The documentation showed me how to declare hashes of hashes and their references, but I really need to be able to generate them automaticaly via a foreach loop or something similar.

prex1 · Sep 2, 2008

The following would be an (untested) [tt]sub recursiveurls[/tt] that receives a reference to an url in the hash %url. It calls a sub [tt]geturl[/tt] (you should provide it) that gets all the links contained in the url provided as the argument, filtering them out and doing any required housekeeping, and returns a reference to that list. The hash [tt]%urlsofar[/tt] contains all the urls encountered so far, and any url returned from the routine above and contained in it is not expanded (but appears in the hash).
The sub [tt]recursiveurls[/tt] is recursive and I think the only way to stop it going deeper and deeper is to define a depth level.

Code:

my%url=('[URL unfurl="true"]http://www.myspace.com',{});[/URL]
recursiveurls(\%url);
{
  my(%urlsofar,$reflist);
  my$DEPTHLEVEL=5;
  sub recursiveurls{
    $DEPTHLEVEL--;
    return if$DEPTHLEVEL<0;
    my($refurl)=@_;
    for my$u(keys%$refurl){
      unless(exists$urlsofar{$u}){
        $urlsofar{$u}=1;
        $reflist=geturl($u);
        for my$v(@$reflist){
          $$refurl{$u}{$v}={};
        }
        recursiveurls(\%{$$refurl{$u}});
        $DEPTHLEVEL++;
      }
    }
  }
}

There are of course many other issues with your planned operation. E.g.: two different urls may well refer to the same page, so you could have duplicates in your hash structure (and probably still a possibility of closed circuits).
Also I'm almost sure the handling of [tt]$DEPTHLEVEL[/tt] above doesn't work: have no time to test it at the moment.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tules · Sep 2, 2008

Hi, thankyou very much for taking the time to do this but I've decided to take a different route under the advice of a programming friend, I now store hash references in an array, each hash has a URL, an ID number and a parent ID making it easier to trace a path all the way back to base URL. If you're interested have a look:

use strict;
use Data:

umper;
use warnings;
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my @collected_stuff;

my %hash = (url => "

http://www.myspace.com",

id => 1,
parent => 0
);

push (@collected_stuff, {%hash});

my $p = HTML::LinkExtor->new();

my $page = 0;
open(LOG, ">log.txt");

my $i = 1; my $i2 = ($i + 1);

while (scalar(@collected_stuff) < 250)

{my $ref = \$collected_stuff[$page]{url};

my $key = $$ref;

my $content = get($key);

$p->parse($content)->eof;

my @links = $p->links;

my @links1 = ();

foreach my $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

my @links2 = ();

foreach my $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http

/g)
{push (@links2, $link1);}
}

@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

$page++;

foreach my $link2 (@links2)
{$collected_stuff[$i] = { url => $link2,
id => $i2,
parent => $page
};
$i++; $i2++;
};

foreach my $idx (0..$#collected_stuff)
{my $ref_hash = $collected_stuff[$idx];
foreach my $name (keys %$ref_hash)
{print LOG $name . " " . $$ref_hash{$name} . "\n";
}
print LOG "\n\n";
};

};
close (LOG);

There are still a couple of issues with keeping correct ID and duplicate links but I will iron these out gradualy, but overall this seems a much more forgiving approach!

prex1 · Sep 3, 2008

This is, in my opinion, an old style solution, from the times when programming languages had no list processing capabilities, and the only structure handling capability they had were linear arrays to handle the linear structure of computer memory. Then C pointers came with indirect addressing, and all the rest, up to arrays of arrays and hash of hashes that effectively handle for you tree and other complex structures.
Apart from that, these are some other comments of varying importance on your code:
-you should put code lines between [ignore]

Code:

[/ignore] tags (and use [ignore][ignore][/ignore][/ignore] tags to avoid the fastidious interpretation of something like [ignore]$link1[/ignore] as references to forums $link1)
-the line
[tt]push (@collected_stuff, {%hash});[/tt]
is incorrect as you [tt]push[/tt] a reference to a hash duplicating [tt]%hash[/tt], so that subsequent changes to [tt]%hash[/tt] would not be included (though [tt]%hash[/tt] is never reused in your code)
-closing semicolons for [tt]foreach[/tt] and [tt]while[/tt] blocks are not required and is a good habit to not use them (see a recent discussion on that on this forum)
-the use of hash references for the array [tt]@collected_stuff[/tt] is not very efficient, as the contained hashes have systematically only three keys and always the same keys; it is much more efficient to use a reference to an array of three elements:

Code:

$collected_stuff[$i]=[[ignore]$link2[/ignore],$i2,$page];

-in lines as
[tt][ignore]$link1[/ignore]=~s/\/$//g;[/tt]
the [tt]g[/tt] option is unnecessary (don't know, but could result in a performance penalty)

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Robot: storing URLs in a multidimensional hash

Tules

Programmer

prex1

Programmer

Tules

Programmer

prex1

Programmer

Tules

Programmer

prex1

Programmer

Similar threads

Part and Inventory Search

Sponsor