Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

LWP with subrutines - little correction needed 1

Status
Not open for further replies.

rimbaud1964

Programmer
Aug 25, 2006
8
DE
hello all,

g day dear folks at this forum at TEK-tips. well it took me more than 20 minutes to sign up but finally i am happy.



Now my question is - can i apply the code on the part of the board. i am pretty new to perl - and i want to solve some tasks that have to do with my current work on a study. There fore in need a spider -that runs against a forum



In order to get a "Copy" of the board with category 17 and category
3 .... see here

==readers from here i look forward to hear from you



, this is a super: this is obviously a great idea that is written here. Now my question is -
can i apply the code on the part of the board. In order to get
a "Copy" of the board with category 17 and category 3 ....

==Can this be done with the code written above?!

well i am very happy,
the demonstration is very imressive - and makes me thinking that Perl is very very powerful.
I will try to harvest this category of the Forum (note those both categories are of my
interest nothing more:
==
i want to discuss a little change here. The minimal change consists of changing

Code:
my $url = "[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17";[/URL]
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);
my @links;
get_threads($url);
foreach my $page (@links) {
    ...
}
to

Code:
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);
my @links;
foreach my $forum_id (17, 3) {
    my $url = "[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=$forum[/URL]
+_id";
    @links = ();  # yuck!
    my $links = get_threads($url);
    foreach my $page (@$links) {
        ...
    }
}
As i want to show, i change the use of the global variable @links.
We're forced to provide and initialize a variable that should be local to get_threads. Here's the fix:

Code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;
use Data::Dumper; # for show and troubleshooting
my $ua = LWP::RobotUA->new();
foreach my $forum_id (17, 3) {
    my $url = "[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=$forum[/URL]
+_id";
    my $links = get_threads($url);
    foreach my $page (@$links) {
        ...
    }
}
sub get_thread {
    ...
}
sub get_threads {
    my $page = shift;
    my @links;
    my $lp = HTML::LinkExtor->new(sub {
        my($tag, %attr) = @_;
        return unless exists $attr{'href'};
        return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
        push @links, values %attr;
    });
    my $request = HTTP::Request->new(GET => $url);
    my $response = $ua->request($request, sub {$lp->parse($_[0])});
    # Expand URLs to absolute ones
    my $base = $response->base;
    return [ map { url($_, $base)->abs } @links ];
Discussion:
with that changes i am able to run the code agains the full category.

==

Question - am i able to get the results of the above mentionde forum categories - and can i get the forum threads that are stored in the two above forums.... i love to hear from you. And all the other readers from here] i look forward to hear from you

Guess taht i need subroutines. i really look forward to hear from you

regards
 
Same reply as other forum:

your post is really much too long, probably nobody is going to read all that. My question to you is have you tried the code? Did it work or not? Any error messages if it did not work?
 
hello KevinADC


many many thanks for the reply. I appreciate your help and i am very very happy that you answer!! To be frank i did not test the script so far. I want to testrun it at Sunday since i have no Linux-machine here - and no perl installed on this windows box.

KevinADC to admit - i am a Perl-novice and ihave not so much experience in perl. But i am willing to learn. i want to learn perl. As for now i have to solve some tasks for the college. I have to do some investigations on a board
where i have no access to the db.


first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions.

to give an example - let us take this forum here. How can i grab all the data out of this forum - and get it local and then after wards put it in a local database - of a phpBB-forum - is this possible"?!"?
=[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]

Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is. i have to get the data - so what?
I need the data in a allmost full and complete formate. So i need all the data like

username .-
forum
thread
topic
text of the posting and so on and so on.

how to do that?


=[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
=[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]




Code:
#!/usr/bin/perl
use strict;
use warnings;

use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;

use Data::Dumper; # for show and troubleshooting

my $url = "[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17";[/URL]
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);

my @links;
get_threads($url);

foreach my $page (@links) { # this loops over each link collected from the index
	my $r = $ua->get($page);
	if ($r->is_success) {
		my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
		# just printing what was collected
		print Dumper get_thread($stream);
		# would instead have database insert statement at this point
	 } else {
		warn $r->status_line;
	 }
}

sub get_thread {
	my $p = shift;
	my ($title, $name, @thread);
	while (my $tag = $p->get_tag('a','span')) {
		if (exists $tag->[1]{'class'}) {
			if ($tag->[0] eq 'span') {
				if ($tag->[1]{'class'} eq 'name') {
					$name = $p->get_trimmed_text('/span');
				} elsif ($tag->[1]{'class'} eq 'postbody') {
					my $post = $p->get_trimmed_text('/span');
					push @thread, {'name'=>$name, 'post'=>$post};
				}
			} else {
				if ($tag->[1]{'class'} eq 'maintitle') {
					$title = $p->get_trimmed_text('/a');
				}
			}
		}
	}
	return {'title'=>$title, 'thread'=>\@thread};
}

sub get_threads {
	my $page = shift;
	my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])});
	# Expand URLs to absolute ones
	my $base = $r->base;
	return [map { $_ = url($_, $base)->abs; } @links];
}

sub wanted_links {
	my($tag, %attr) = @_;
	return unless exists $attr{'href'};
	return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
	push @links, values %attr;
}



If we have the necessary modules installed, and run it from the command line you'll see output such as the following:



Code:
$VAR1 = {
          'thread' => [
                        {
                          'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve got my site up and running great! I\'m now starting to make modifications, add modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most recent forum posts at the top of the forum page. I\'m not sure if this functionality is built in, if so, how to activate. Or if there is a module or block made to do this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, and I don\'t want it to be collapsable. Thanks! mopho',
                          'name' => 'mopho'
                        },
                        {
                          'post' => 'hi there',
                          'name' => 'sail'
                        },
                        {
                          'post' => 'thanks for asking this; :not very sure if i got you right; Do you want to have a feed of the last forumthreads? guess the easiest way is to go to raven and ask how he did it. hth sail.',
                          'name' => 'sail'
                        },
                        {
                          'post' => 'Thanks. i found what I was looking for. It wasn\'t so easy to find! It\'s called glance_mod. mopho',
                          'name' => 'mopho'
                        },
                        {
                          'post' => 'hi there thx',
                          'name' => 'sail'
                        },
                        {
                          'post' => 'it sound interesting - i will have also a look i google after it - and try to find out more regards sailor',
                          'name' => 'sail'
                        }
                      ],
          'title' => 'Recent Forum Posts Module'
        };



to be honest - i think that the thing is to run
the script just looped over the first index page here =[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]

But I need it to loop over all the more than 50 pages. Therefore I need to get a routine here




Code:
#!/usr/bin/perl
use strict;
use warnings;

use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;

use Data::Dumper; # for show and troubleshooting

my $url = "[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17";[/URL]
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);

my @links;
get_threads($url);

foreach my $page (@links) { # this loops over each link collected from the index
	my $r = $ua->get($page);
	if ($r->is_success) {
		my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
		# just printing what was collected
		print Dumper get_thread($stream);
		# would instead have database insert statement at this point
	 } else {
		warn $r->status_line;
	 }
}


This must get a subroutine - doesn t it? It has to get a subroutine in order to
let the script loop over all the pages in the forum =[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]


in the above version it isnt set up a loop to grab each of the index pages but someone may consider that trivial.

the demonstration is very imressive - and makes me thinking that Perl is very very powerful. I will try to harvest this category of the Forum (note those both categories are of my interest nothing more:

=[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
=[URL unfurl="true"]http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]


Question - am i able to get the results of the above mentionde forum categories - and can i get the forum threads that are stored in the two above forums....

KevinADC i love to hear from you

best regards
rimbaud

ps - if i have to write a better question - then let me know! i will do that
 
Question - am i able to get the results of the above mentionde forum categories - and can i get the forum threads that are stored in the two above forums....

In theory the answer is: probably yes. In practice the answer could be complicated and is really a bit much for a help forum like this.

Try and work on the problem one section/problem at a time instead of trying to get the overall solution all at once. Start be determiningg if the necessary modules are installed:

LWP::RobotUA
HTML::LinkExtor
HTML::TokeParser
URI::URL
 
Hello KevinADC,


many many thanks for the reply - i will follow your advice and will have a look. I will go step by step and will see which results i get.

i will come back when i know more

meanwhile

best regards
rimbaud
 
hi KevinADC


solved the issue with the subroutine.

if we want to 'loop' over the URLs, we could either run the spider multiple times, or put a 'foreach' loop around the main body of our program.

my @urls = (""
foreach my $url (@urls) {
# main code
}

what do you think!?=


again many many thank for the ideas and the support.
btw: great site here


rinbi
 
The loop is fine. That's exactly what you need to do.
 
hello KevinADC,

many thanks for your great support. I really appreciate your great help. I feel encouraged to dig deeper in perl
since you are here with your help and great kind of , let me say "supervision" on my humble ideas ,....

again many many thanks

this is a great place here.

rimbi

Keep up this great place and forum
@KevinADC - two thumbs up: you are very very good helper!!
 
hia all

in order to get the data into a phpBB-database i am curreintly ironing out the mehtods that can do that.

does this make sence to combine the above mentioned perl-scripts with some sort of php processing as the following....

preg_match see [URL unfurl="true"]http://www.php.net/preg_match [/url]
int preg_match ( string pattern, string subject [, array &matches [, int flags [, int offset]]] ) Searches subject for a match to the regular expression given in pattern. If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on

see the following thread here [URL unfurl="true"]http://www.tek-tips.com/viewthread.cfm?qid=1272464&page=1 [/url]

Goal: i want to store the data in a local database...

i look forward to hear from you - regards

rimbi,
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top