Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Spidering website

Status
Not open for further replies.

forces1

Programmer
Apr 5, 2007
29
NL
Hi all,

I've designed a simple spider for a search engine, which works like this:
Code:
$url = $q->param("url");
$sp_url = $url;
$content = get($url);
$modifylink = 'new';

  if ($content) {
    #Get the title
    $content =~ /<title>(.*)<\/title>/ig;
    $sp_title = $1;
    $sp_title =~ s/\"//g; #remove double quotes
    $sp_title =~ s/\'//g;  #remove single quotes
    #Get the description
    $content =~ /<META name=\"description\" content=\"(.*?)\">/i;
    $sp_desc = $1; 
    $sp_desc =~ s/\"//g; #remove double quotes
    $sp_desc =~ s/\'//g;  #remove single quotes
    #Get the keywords
    $content =~ /<META name=\"keywords\" content=\"(.*?)\">/i;
    $sp_keys = $1;
    $sp_keys =~ s/\"//g; #remove double quotes
    $sp_keys =~ s/\'//g;  #remove single quotes
So it get's the description, keywords and title for me, but I also want to index the content of the page; I also want the content between the body-tags. But this won't work:
Code:
#Get the title
    $content =~ /<body>(.*)<\/body>/ig;
    $sp_body = $1;
    $sp_body =~ s/\"//g; #remove double quotes
    $sp_body =~ s/\'//g;  #remove single quotes
When I try to write it to the database, it gives zero result.
Can anyone help me with this please? Thank you!
 
Hi Guys,

You've been a wonderful help and thanks to your help I'm making very good progress, it's almost finished. But I have the next question:
When HTML::parser gets menu for example, build like this:
Code:
<table><tr><td><a href="option1.html">Option1</a></td></tr>
<tr><td><a href="option2.html">Option2</a></td></tr></table>
Then the HTML:parser will filter out the table-tags and the a-tag, just like I told it, so that only the raw text remains. But because of the filtering, what remains is:
Code:
Option1Option2
while I would like it like
Code:
Option1 Option2
How could I fix this? Can I make the HTML::parser for example replace the closetag </a> with a space?

Thanks a lot, guys. You've been great help!
 
You could, but I would advise against it. Those two options look like a list, I therefore suggest that you treat them as such. Push those text values into an array, and then once the array is built do a join.

- Miller

PS,
It's easier to help someone when they provide their current code.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top