Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Spidering website

Status
Not open for further replies.

forces1

Programmer
Apr 5, 2007
29
NL
Hi all,

I've designed a simple spider for a search engine, which works like this:
Code:
$url = $q->param("url");
$sp_url = $url;
$content = get($url);
$modifylink = 'new';

  if ($content) {
    #Get the title
    $content =~ /<title>(.*)<\/title>/ig;
    $sp_title = $1;
    $sp_title =~ s/\"//g; #remove double quotes
    $sp_title =~ s/\'//g;  #remove single quotes
    #Get the description
    $content =~ /<META name=\"description\" content=\"(.*?)\">/i;
    $sp_desc = $1; 
    $sp_desc =~ s/\"//g; #remove double quotes
    $sp_desc =~ s/\'//g;  #remove single quotes
    #Get the keywords
    $content =~ /<META name=\"keywords\" content=\"(.*?)\">/i;
    $sp_keys = $1;
    $sp_keys =~ s/\"//g; #remove double quotes
    $sp_keys =~ s/\'//g;  #remove single quotes
So it get's the description, keywords and title for me, but I also want to index the content of the page; I also want the content between the body-tags. But this won't work:
Code:
#Get the title
    $content =~ /<body>(.*)<\/body>/ig;
    $sp_body = $1;
    $sp_body =~ s/\"//g; #remove double quotes
    $sp_body =~ s/\'//g;  #remove single quotes
When I try to write it to the database, it gives zero result.
Can anyone help me with this please? Thank you!
 
without the 's' modifier on the regex for the body tag, the . will not pick up return characters. Simply change your code to:

Code:
[blue]$content[/blue] =~ [red]m{[/red][purple]<body>(.*)</body>[/purple][red]}[/red][red]isg[/red][red];[/red]

- Miller
 
Also, it would be a good idea if you add error checking to your regular expressions instead of assuming that they actually match:

Code:
[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$url[/blue] = [blue]$q[/blue]->[maroon]param[/maroon][red]([/red][red]"[/red][purple]url[/purple][red]"[/red][red])[/red][red];[/red]
[black][b]my[/b][/black] [blue]$sp_url[/blue] = [blue]$url[/blue][red];[/red]
[black][b]my[/b][/black] [blue]$content[/blue] = [maroon]get[/maroon][red]([/red][blue]$url[/blue][red])[/red][red];[/red]
[black][b]my[/b][/black] [blue]$modifylink[/blue] = [red]'[/red][purple]new[/purple][red]'[/red][red];[/red]

[url=http://perldoc.perl.org/functions/sub.html][black][b]sub[/b][/black][/url] [maroon]unquote[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$string[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
	[blue]$string[/blue] =~ [red]s/[/red][purple][[purple][b]\"[/b][/purple][purple][b]\'[/b][/purple]][/purple][red]/[/red][purple][/purple][red]/[/red][red]g[/red][red];[/red] [gray][i]#remove quotes[/i][/gray]
	[url=http://perldoc.perl.org/functions/return.html][black][b]return[/b][/black][/url] [blue]$string[/blue][red];[/red]
[red]}[/red]

[black][b]my[/b][/black] [red]([/red][blue]$sp_title[/blue], [blue]$sp_desc[/blue], [blue]$sp_keys[/blue][red])[/red][red];[/red]

[olive][b]if[/b][/olive] [red]([/red][blue]$content[/blue][red])[/red] [red]{[/red]
	[gray][i]#Get the title[/i][/gray]
	[olive][b]if[/b][/olive] [red]([/red][blue]$content[/blue] =~ [red]m{[/red][purple]<title>(.*)</title>[/purple][red]}[/red][red]is[/red][red])[/red] [red]{[/red]
		[blue]$sp_title[/blue] = [maroon]unquote[/maroon][red]([/red][blue]$1[/blue][red])[/red][red];[/red]
	[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
		[url=http://perldoc.perl.org/functions/die.html][black][b]die[/b][/black][/url] [red]"[/red][purple]No Title found[/purple][red]"[/red][red];[/red]
	[red]}[/red]
	
	[gray][i]#Get the description[/i][/gray]
	[olive][b]if[/b][/olive] [red]([/red][blue]$content[/blue] =~ [red]m{[/red][purple]<META name="description" content="(.*?)">[/purple][red]}[/red][red]is[/red][red])[/red] [red]{[/red]
		[blue]$sp_desc[/blue] = [maroon]unquote[/maroon][red]([/red][blue]$1[/blue][red])[/red][red];[/red]
	[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
		[black][b]die[/b][/black] [red]"[/red][purple]No Description found[/purple][red]"[/red][red];[/red]
	[red]}[/red]
	
	[gray][i]#Get the keywords[/i][/gray]
	[olive][b]if[/b][/olive] [red]([/red][blue]$content[/blue] =~ [red]m{[/red][purple]<META name="keywords" content="(.*?)">[/purple][red]}[/red][red]is[/red][red])[/red] [red]{[/red]
		[blue]$sp_keys[/blue] = [maroon]unquote[/maroon][red]([/red][blue]$1[/blue][red])[/red][red];[/red]
	[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
		[black][b]die[/b][/black] [red]"[/red][purple]No keywords found[/purple][red]"[/red][red];[/red]
	[red]}[/red]

- Miller
 
As a matter of interest, why remove quote characters?

Also, when parsing HTML, it's generally a good idea to use a proper tag-aware HTML parser, such as HTML::TokeParser or HTML::TokeParser::Simple, rather than trying to contstruct a whole pile of regexps that may not work (particularly on badly-formed HTML). For instance, although it's not valid, many people use single quotes (or none at all) to quote attributes in their HTML tags. If they do, your regexps won't match. The HTML::parser-based modules are designed to degrade gracefully in the presence of bad code.
 
Thanks all for helping, I will try your suggestions as soon as possible. Great, guys!

Question for Ishnid:
Thanks for thinking with me and I would love to follow your advise, but where can I find a HTML-parser like HTML::TokeParser and how do I use it?

Thank you!
 

If you're working on Windows, you should have a ppm program that will allow you to install modules.

If you've on Linux, you can run:
Code:
$ perl -MCPAN -e 'install HTML::TokeParser'
 
Thanks a lot, you are such a great help. Ok, I've installed the HTML::TokeParser, it shows up in my Installed Perl Modules. So what do I do now? How does it work?
 
Also, instead of using HTML::TokeParser, you could instead use the module from which it is derived: HTML::parser.


It is annoyingly complex in its interface, which is why there are so many subclasses with simpler usage. However, once mastered, it can accomplish most parsing problems quite well. There is even an example included in the distribution for extracting the title from an html page:


Again, not for everyone. But it might serve to at least check it out.

- Miller
 
Hi all,
first of all great thanks to all for your help and your patience with me ;) I've tried the code from MillerH and it worked. Now I also want to try the HTML-parser from ishnid.
After I've installed the HTML::TokeParser and I've opened the url ishnid sent me. This is the code I use now to extract the title, as mentioned on the site:
Code:
    #Get the title
  use HTML::TokeParser;
  $p = HTML::TokeParser->new(shift||"index.html");
  if ($p->get_tag("title")) {
      my $title = $p->get_trimmed_text;
      print "Title: $title\n";
  }

But when I run it, I get the error:
Can't call method "get_tag" on an undefined value at metaspider.cgi line 57.

It's probably a problem of the installation from the HTML::TokeParser, or a problem with the
Code:
$p = HTML::TokeParser->new(shift||"index.html");

@ MillerH: If this doens't work, I will try your option, with the HTML::parser.

Thanks all for your great help!
 
That would be to get the parser to parse the file called 'index.html' in the current working directory. Is that the file you're trying to parse?
 
Hi,
no it's not. I want to parse the file that's called in the
Code:
my $url = $q->param("url");
But
Code:
$p = HTML::TokeParser->new(shift||"$url");
doesn't work. Then I still get the error.
 
When you pass a string to HTML::TokeParser, it treats it as the name of a file that's located on the local filesystem (see the documentation for the "new" method). It won't read it from a URL.

If you've already read from the URL, you can pass a reference to a scalar containing the actual HTML code itself, i.e.
Code:
my $content = get($url);
my $p = HTML::TokeParser->new( \$content );
 
Thanks ishnid,
your code worked perfectly. But now I have another problem, haha. Keep getting new problems, but that may be the charm of programming.

When I use this:
Code:
my $url = $q->param("url");
my $sp_url = $url;
my $content = get($url);
my $modifylink = 'new';

sub unquote {
    my ($string) = @_;
    $string =~ s/[\"\']//g; #remove quotes
    return $string;
}

my ($titel, $sp_desc, $sp_keys);

my $content1 = get($url);

if ($content) {
    #Get the title
  use HTML::TokeParser;
  my $p = HTML::TokeParser->new( \$content1 );
  if ($p->get_tag("title")) {
      $titel = $p->get_trimmed_text;
	  }
    
    #Get the description
    if ($content =~ m{<META name="description" content="(.*?)">}is) {
        $sp_desc = unquote($1);
    }
    
    #Get the keywords
    if ($content =~ m{<META name="keywords" content="(.*?)">}is) {
        $sp_keys = unquote($1);
    }
	
	#Get the body
	if ($content =~ m{<body>(.*)</body>}is) {
        $sp_body = unquote($1);
    }

my $sth = $dbh->prepare("INSERT INTO metalink VALUES (\"\", \"$url\", \"$sp_body\", \"$titel\")") || &error("Could not insert new row.");
      $sth->execute();
      $sth->finish;
&start;
	}
it won't write the $title to the database, it stays empty. But when I change the place of the $sth, like:
Code:
if ($content) {
    #Get the title
  use HTML::TokeParser;
  my $p = HTML::TokeParser->new( \$content1 );
  if ($p->get_tag("title")) {
      $titel = $p->get_trimmed_text;
      my $sth = $dbh->prepare("INSERT INTO metalink VALUES (\"\", \"$url\", \"$sp_body\", \"$titel\")") || &error("Could not insert new row.");
      $sth->execute();
      $sth->finish;
	  }
it will write the $title to the database, but then de rest of the values are not yet made, so it will only write the $title. How can that be fixed?
 
Here is working code that uses HTML::parser to pull the different element things that you are wanting to parse. As you can see, it is more complicated than a regex solution. However, it is much more stable since it relies on the nature of html and not the format of the html that a strict regex would. Feel free to use this or not.

I've also included a cleaned up version of your database operations, although it is commented out currently. There are a couple of things that I want to note:

[ul]
[li]Never rely on an assumed list for an INSERT statement. Instead always explicitly state what columns are being assigned. This is for two reasons. One, it is easier to debug sql statements later on when the column names are explicitly stated. More importantly though, if you should ever add columns to your table, implicit statements will break where as explicit statements will continue to work. Therefore always use the INSERT syntax of "INSERT INTO tablename SET field1=?, field2=?, field3=?".[/li]
[li]Always use placeholders. Never interpolate values into a sql statement. This opens you up to not only bugs where values have not been escaped, but also to security risks if someone intelligent were to include malicious values. By using placeholders, DBI automatically escapes values and also protects against malicious input.[/li]
[li]Finally, include your or die with the execute statement and not the prepare statement. I've got no reason for this other than just 'cuz.[/li]

[/ul]

Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]HTML::Parser[/green] [red]([/red][red])[/red][red];[/red]

[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$url[/blue] = [red]"[/red][purple][URL unfurl="true"]http://foo.bar[/URL][/purple][red]"[/red][red];[/red]
[black][b]my[/b][/black] [blue]$content[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[black][b]my[/b][/black] [red]([/red][blue]$title[/blue], [blue]$sp_desc[/blue], [blue]$sp_keys[/blue], [blue]$sp_body[/blue][red])[/red][red];[/red]

[olive][b]if[/b][/olive] [red]([/red][blue]$content[/blue][red])[/red] [red]{[/red]
	[blue]$title[/blue] = [maroon]get_html_title[/maroon][red]([/red][blue]$content[/blue][red])[/red][red];[/red]
	[blue]$sp_desc[/blue] = [maroon]unquote[/maroon][red]([/red][maroon]get_html_description[/maroon][red]([/red][blue]$content[/blue][red])[/red][red])[/red][red];[/red]
	[blue]$sp_keys[/blue] = [maroon]unquote[/maroon][red]([/red][maroon]get_html_keywords[/maroon][red]([/red][blue]$content[/blue][red])[/red][red])[/red][red];[/red]
	[blue]$sp_body[/blue] = [maroon]get_html_body[/maroon][red]([/red][blue]$content[/blue][red])[/red][red];[/red]

[gray]=comment[/gray]
[gray]	my $sth = $dbh->prepare(qq{INSERT INTO metalink VALUES SET url=?, body=?, title=?});[/gray]
[gray]	$sth->execute($url, $sp_body, $sp_title) or error("Can't execute: " . $dbh->errstr);[/gray]
[gray]	$sth->finish; undef $sth;[/gray]

[gray]	&start;[/gray]
[gray]=cut[/gray]
[red]}[/red]

[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]<<"END_DEBUG"[/red][red];[/red]
[purple]Url = '[blue]$url[/blue]'[/purple]
[purple]Title = '[blue]$title[/blue]'[/purple]
[purple]Desc = '[blue]$sp_desc[/blue]'[/purple]
[purple]Keys = '[blue]$sp_keys[/blue]'[/purple]
[purple]Body = '[blue]$sp_body[/blue]'[/purple]
[red]END_DEBUG[/red]


[gray][i]#############################[/i][/gray]
[gray][i]### Supporting Fuctions[/i][/gray]

[url=http://perldoc.perl.org/functions/sub.html][black][b]sub[/b][/black][/url] [maroon]get_html_title[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$html[/blue][red])[/red] = [blue]@_[/blue][red];[/red]

	[black][b]my[/b][/black] [blue]$title[/blue] = [url=http://perldoc.perl.org/functions/undef.html][black][b]undef[/b][/black][/url][red];[/red]

	[black][b]my[/b][/black] [blue]$p[/blue] = HTML::Parser->[maroon]new[/maroon][red]([/red]
		[purple]start_h[/purple] => [red][[/red][black][b]sub[/b][/black] [red]{[/red]
			[black][b]my[/b][/black] [red]([/red][blue]$self[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
			[blue]$title[/blue] = [red]'[/red][purple][/purple][red]'[/red][red];[/red]
			[blue]$self[/blue]->[maroon]handler[/maroon][red]([/red][purple]text[/purple] => [black][b]sub[/b][/black] [red]{[/red] [blue]$title[/blue] .= [blue]$_[/blue][red][[/red][fuchsia]0[/fuchsia][red]][/red][red];[/red] [red]}[/red], [red]'[/red][purple]dtext[/purple][red]'[/red][red])[/red][red];[/red]
		[red]}[/red], [red]'[/red][purple]self[/purple][red]'[/red][red]][/red],
		[purple]end_h[/purple] => [red][[/red][red]'[/red][purple]eof[/purple][red]'[/red], [red]'[/red][purple]self[/purple][red]'[/red][red]][/red],
		[purple]report_tags[/purple] => [red][[/red][red]'[/red][purple]title[/purple][red]'[/red][red]][/red],
	[red])[/red][red];[/red]
	[blue]$p[/blue]->[maroon]parse[/maroon][red]([/red][blue]$html[/blue][red])[/red][red];[/red]

	[url=http://perldoc.perl.org/functions/return.html][black][b]return[/b][/black][/url] [blue]$title[/blue][red];[/red]
[red]}[/red]

[black][b]sub[/b][/black] [maroon]get_html_meta[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$html[/blue], [blue]$meta_name[/blue][red])[/red] = [blue]@_[/blue][red];[/red]

	[black][b]my[/b][/black] [blue]$content[/blue] = [black][b]undef[/b][/black][red];[/red]

	[black][b]my[/b][/black] [blue]$p[/blue] = HTML::Parser->[maroon]new[/maroon][red]([/red]
		[purple]start_h[/purple] => [red][[/red][black][b]sub[/b][/black] [red]{[/red]
			[black][b]my[/b][/black] [red]([/red][blue]$attr[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
			[olive][b]if[/b][/olive] [red]([/red][url=http://perldoc.perl.org/functions/exists.html][black][b]exists[/b][/black][/url] [blue]$attr[/blue]->[red]{[/red]name[red]}[/red] && [blue]$attr[/blue]->[red]{[/red]name[red]}[/red] eq [blue]$meta_name[/blue][red])[/red] [red]{[/red]
				[blue]$content[/blue] = [blue]$attr[/blue]->[red]{[/red]content[red]}[/red][red];[/red]
			[red]}[/red]
		[red]}[/red], [red]'[/red][purple]attr[/purple][red]'[/red][red]][/red],
		[purple]report_tags[/purple] => [red][[/red][red]'[/red][purple]meta[/purple][red]'[/red][red]][/red],
	[red])[/red][red];[/red]
	[blue]$p[/blue]->[maroon]parse[/maroon][red]([/red][blue]$html[/blue][red])[/red][red];[/red]

	[black][b]return[/b][/black] [blue]$content[/blue][red];[/red]
[red]}[/red]

[black][b]sub[/b][/black] [maroon]get_html_description[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$html[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
	[black][b]return[/b][/black] [maroon]get_html_meta[/maroon][red]([/red][blue]$html[/blue], [red]'[/red][purple]description[/purple][red]'[/red][red])[/red][red];[/red]
[red]}[/red]

[black][b]sub[/b][/black] [maroon]get_html_keywords[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$html[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
	[black][b]return[/b][/black] [maroon]get_html_meta[/maroon][red]([/red][blue]$html[/blue], [red]'[/red][purple]keywords[/purple][red]'[/red][red])[/red][red];[/red]
[red]}[/red]

[black][b]sub[/b][/black] [maroon]get_html_body[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$html[/blue][red])[/red] = [blue]@_[/blue][red];[/red]

	[black][b]my[/b][/black] [blue]$body[/blue] = [black][b]undef[/b][/black][red];[/red]

	[black][b]my[/b][/black] [blue]$p[/blue] = HTML::Parser->[maroon]new[/maroon][red]([/red]
		[purple]start_h[/purple] => [red][[/red][black][b]sub[/b][/black] [red]{[/red]
			[black][b]my[/b][/black] [red]([/red][blue]$self[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
			[blue]$body[/blue] = [red]'[/red][purple][/purple][red]'[/red][red];[/red]
			[blue]$self[/blue]->[maroon]report_tags[/maroon][red]([/red][red])[/red][red];[/red]
			[blue]$self[/blue]->[maroon]handler[/maroon][red]([/red][purple]start[/purple] => [black][b]undef[/b][/black][red])[/red][red];[/red]
			[blue]$self[/blue]->[maroon]handler[/maroon][red]([/red][purple]default[/purple] => [black][b]sub[/b][/black] [red]{[/red] [blue]$body[/blue] .= [url=http://perldoc.perl.org/functions/shift.html][black][b]shift[/b][/black][/url][red];[/red] [red]}[/red], [red]'[/red][purple]text[/purple][red]'[/red][red])[/red][red];[/red]
			[blue]$self[/blue]->[maroon]handler[/maroon][red]([/red][purple]end[/purple] => [black][b]sub[/b][/black] [red]{[/red]
				[black][b]my[/b][/black] [red]([/red][blue]$self[/blue], [blue]$tagname[/blue], [blue]$text[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
				[olive][b]if[/b][/olive] [red]([/red][blue]$tagname[/blue] eq [red]'[/red][purple]body[/purple][red]'[/red][red])[/red] [red]{[/red]
					[blue]$self[/blue]->[maroon]eof[/maroon][red]([/red][red])[/red][red];[/red]
				[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
					[blue]$body[/blue] .= [blue]$text[/blue][red];[/red]
				[red]}[/red]
			[red]}[/red], [red]'[/red][purple]self,tagname,text[/purple][red]'[/red][red])[/red][red];[/red]
		[red]}[/red], [red]'[/red][purple]self[/purple][red]'[/red][red]][/red],
		[purple]report_tags[/purple] => [red][[/red][red]'[/red][purple]body[/purple][red]'[/red][red]][/red],
	[red])[/red][red];[/red]
	[blue]$p[/blue]->[maroon]parse[/maroon][red]([/red][blue]$html[/blue][red])[/red][red];[/red]

	[black][b]return[/b][/black] [blue]$body[/blue][red];[/red]
[red]}[/red]

[black][b]sub[/b][/black] [maroon]unquote[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$string[/blue][red])[/red] = [blue]@_[/blue][red];[/red]
	[black][b]return[/b][/black] [olive][b]if[/b][/olive] ! [url=http://perldoc.perl.org/functions/defined.html][black][b]defined[/b][/black][/url] [blue]$string[/blue][red];[/red]
	[blue]$string[/blue] =~ [red]s/[/red][purple][[purple][b]\"[/b][/purple][purple][b]\'[/b][/purple]][/purple][red]/[/red][purple][/purple][red]/[/red][red]g[/red][red];[/red] [gray][i]#remove quotes[/i][/gray]
	[black][b]return[/b][/black] [blue]$string[/blue][red];[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[gray][i]#############################[/i][/gray]
[gray][i]### Data Block[/i][/gray]

[teal]__DATA__[/teal]
[teal]<html>[/teal]
[teal]<head>[/teal]
[teal]<title>My Title</title>[/teal]
[teal]<meta name="description" content="My Description">[/teal]
[teal]<META CONTENT="My Keywords" NAME="keywords">[/teal]
[teal]</head>[/teal]
[teal]<body bgcolor="white">[/teal]
[teal]<b>Mary</b> had a little lamb<br />[/teal]
[teal]Then the lamb ate <b>Mary</b><br />[/teal]
[teal]And it was little no more<br />[/teal]
[teal]</body>[/teal]
[teal]</html>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Other Modules used :
[ul]
[li]HTML::parser[/li]
[/ul]
[/tt]

- Miller
 
Hi Miller,
Thanks, this works perfectly! I've even filled in that it will spider a link instead of the Data Block. With the great help of you guys this script will work eventually.

The only problem is that it doens't write to the database. So I must have filled in something wrong in this part:
Code:
=comment
    my $sth = $dbh->prepare(qq{INSERT INTO metalink VALUES SET url=$url, inhoud=$sp_body, title=$title});
    $sth->execute($url, $sp_body, $title) or error("Can't execute: " . $dbh->errstr);
    $sth->finish; undef $sth;

    &start;
=cut

Could you please tell me what I've done wrong?

Thanks a lot, Guys..
 
Yep. You need to create your placeholders with '?' characters, rather than the values that will go into them, like so:
Code:
  my $sth = $dbh->prepare(qq{INSERT INTO metalink VALUES SET url=?, inhoud=?, title=?});
    $sth->execute($url, $sp_body, $title) or error("Can't execute: " . $dbh->errstr);
    $sth->finish; undef $sth;

See this article on using placeholders properly.
 
The database code that I provided to you was already correct except for the made up names for the fields. The only thing you had to do was give the fields their proper names and then remove the pod comment. ie: "=comment" and "=cut".

- Miller
 
Hi all,
the tips you gave me all worked out fine an it works great now. But now I want to add a new function, that extracts the links from a page, so that I can spider these too. First I tried it with the codes you gave me, but I've found out that it's easier with the HTML::LinkExtor. Is that true? Now I've found 2 codes for the HTML:LinkExtor:
Code:
require HTML::LinkExtor;
 $p = HTML::LinkExtor->new(\&cb, "[URL unfurl="true"]http://www.perl.org/");[/URL]
 sub cb {
     my($tag, %links) = @_;
     print "$tag @{[%links]}\n";
 }
 $p->parse_file("index.html");
and
Code:
use LWP::UserAgent;
  use HTML::LinkExtor;
  use URI::URL;

  $url = "[URL unfurl="true"]http://www.perl.org/";[/URL]  # for instance
  $ua = LWP::UserAgent->new;

  # Set up a callback that collect image links
  my @imgs = ();
  sub callback {
     my($tag, %attr) = @_;
     return if $tag ne 'img';  # we only look closer at <img ...>
     push(@imgs, values %attr);
  }

  # Make the parser.  Unfortunately, we don't know the base yet
  # (it might be diffent from $url)
  $p = HTML::LinkExtor->new(\&callback);

  # Request document and parse it as it arrives
  $res = $ua->request(HTTP::Request->new(GET => $url),
                      sub {$p->parse($_[0])});

  # Expand all image URLs to absolute ones
  my $base = $res->base;
  @imgs = map { $_ = url($_, $base)->abs; } @imgs;

  # Print them out
  print join("\n", @imgs), "\n";
. HTML::LinkExtor is already installed, and both codes must work perfectly. But can you guys please help me with which of those 2 is the best code and (more important) how do I put this script as function in my working script?

Thanks guys, you've been a wonderful help and I really really appreciate your help! Without you guys it would have never worked.
smiletiniest.gif


-Forces1
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top