Opening a File on a Website in Text Format? 4

sdslrn123 · Jun 12, 2006

This is more a thought exrcise than anything as I am trying to understand how Perl interacts with web.

If I have a website with a list of film titles
-each film title when clicked opening upto a new page
--each page contains info about film as well as a link to a text document with the script of the film (which can be opened or saved)

I want to grab all the text from one specific actor in a specific film.

Usually, I would just ask the user to save the files to the same file as the program. But, is there a way where if the user just inputs a film-title at command line I can automatically check whether such a FILM_NAME exists by asking program to check:

http://www.**************.com/FILM_NAME

If it does exist then Program will automatically open a text file at the webpage, search through it and remove the necessary data from specific actor?

Sorry, if this is a weird problem!

PaulTEG · Jun 12, 2006

Search CPAN and look for LWP, and

http://WWW::Mechanize

Paul
------------------------------------
Spend an hour a week on CPAN, helps cure all known programming ailments ;-)

raklet · Jun 12, 2006

Not a weird problem at all. As Paul suggested, LWP and

http://WWW::Mechanize

are the way to go - or for an even simpler approach, you could probably use HTML::TokeParser::Simple. For example, I wanted to retrieve all of the quotes found on

http://www.quotationspage.com

I wrote a script that would browse through all of the pages on the site and extract the text of the quotes. Here is my script for instructional use.

Code:

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "[URL unfurl="true"]http://www.quotationspage.com/quotes/$letter.html";[/URL]
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "[URL unfurl="true"]http://www.quotationspage.com"[/URL] . $parent_token->get_attr('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}

sdslrn123 · Jun 12, 2006

Thanks Guys. I appreciate the info and just the knowledge that it is impossible. I hate running down a tunnel at warp speed only to be told it is blocked at the other end and I hould have tried the other tunnel!

I'll let you know how I get on tomorrow.

My problem is I am using activestate. Do I have to save the modules in a certain way?

kre1973 · Jun 12, 2006

raklet, I don't have:
HTML::TokeParser::Simple installed at my company. I'm trying to use: HTML::TokeParser instead on a similiar site on the intranet at my company. But, I'm getting the following error:
Can't call method "get_token" on an undefined value at C:\TESTING\PERL\dailyopsWEB.pl at line 14

Here is my code:

Code:

#!c:/perl/bin/Perl.exe
use strict;
use HTML::TokeParser;

my @letters = qw(a b c d e f g h i j k l m n o p q r s t u v w x y z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "[URL unfurl="true"]http://mutualnet.nml.com/cs/csquickguide.htm/#$letter";[/URL]
    my $parent_parser = HTML::TokeParser->new($baseUrl );
	print $parent_parser;
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "[URL unfurl="true"]http://mutualnet.nml.com"[/URL] . $parent_token->get_attr('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser->new( $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}

Here are some samples of the URL's:

http://mutualnet.nml.com/cs/csquickguide.htm#a

http://mutualnet.nml.com/cs/csquickguide.htm#b

http://mutualnet.nml.com/cs/csquickguide.htm#c

http://mutualnet.nml.com/cs/csquickguide.htm#d

thanks

sdslrn123 · Jun 12, 2006

Code:

Thanks Guys. I appreciate the info and just the knowledge that it is impossible. I hate running down a tunnel at warp speed only to be told it is blocked at the other end and I hould have tried the other tunnel!

I'll let you know how I get on tomorrow.

My problem is I am using activestate. Do I have to save the modules in a certain way?

Hi Guys. I worked it out.
oo Used ppm
oo Installed Package Name HTML-Tokeparser-Simple NOT HTML-Tokeparser-Simple
oo Worked!!

sdslrn123 · Jun 12, 2006

Code:

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "[URL unfurl="true"]http://www.quotationspage.com/quotes/$letter.html";[/URL]
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "[URL unfurl="true"]http://www.quotationspage.com"[/URL] . $parent_token->get_attr('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}

Okay, I just need one push...
if I wanted the contents of wikipedia (I don't really but I am just trying to work out how this works with other websites!). I have tried the following but it does not work.

Code:

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "[URL unfurl="true"]http://en.wikipedia.org/wiki/$letter.html";[/URL]
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "[URL unfurl="true"]http://en.wikipedia.org/wiki"[/URL] . $parent_token->get_attr('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}

Thanks again!

sdslrn123 · Jun 12, 2006

If anyone can help I would be very grateful.

raklet · Jun 12, 2006

Well, you can't just use my code out of the box. It was written specifically for quotationspage.com. If you read it carefully, you see that it is setup to parse tags found in the source of that site. I just gave the code as an example of how to use HTML::TokeParser::Simple. To get the info you want, you have to go to the specific page, view the source, and then write a program that parses through it according to your specifications. If you will give me an actual URL with an example of what you want extracted, I will be glad to help you with it.

sdslrn123 · Jun 12, 2006

Thanks for your help. I am more than happy to work on this, I just need another example, soooo if you are offering

If I just wanted to print the links found on the
(A,B,C,D,E....Z).html pages of WIKIPEDIA (or any other site)

e.g.

http://en.wikipedia.org/wiki/A.html

I just want the main links at the front of each page.

If you can help me with this, that would be perfect as I promise I will endeavour to work on it tonight... no sleep!

raklet · Jun 12, 2006

The link you provided wasn't a valid one, and the explanation of what you want isn't very clear, so I can't offer you a specific script. However, I will post code for another script I wrote that takes weather data from a page and builds a png image out of the data. First, look at the link in a web page to see how it looks. Then, look at the source code to see what it looks like. Finally, compare the web source to my program to see how it is parsing it.

Raklet

PaulTEG · Jun 12, 2006

rak, shine like a star, the star that you are ... ;-)

Paul
------------------------------------
Spend an hour a week on CPAN, helps cure all known programming ailments ;-)

raklet · Jun 12, 2006

sdslrn123,

Just to show you how easy this stuff is....I taught myself a new module tonight that I had no previous experience with (

http://www::mechanize)

and used it to recreate the quote retrieving script. View the web page (quotationspage.com/quotes), view the source for the web pages, run the code I post here and watch what it does. You should be able to get the hang of it. If you want me to comment the code, let me know and I will give you comment information overload.

Code:

#!C:/Perl/bin/perl.exe -w

use strict;
use Data::Dumper;
use [URL unfurl="true"]WWW::Mechanize;[/URL]
use HTML::TokeParser::Simple;
use Text::Wrap;
$| = 1;
my $mech = new [URL unfurl="true"]WWW::Mechanize;[/URL]
$mech->get("[URL unfurl="true"]http://www.quotationspage.com/quotes");[/URL]
$mech->success or die "Can't get the quotes page";
my @index_links = $mech->find_all_links(url_regex => qr[^/quotes/.\.html])
    or die "Can't find the A-Z quotation links";
foreach my $index_link (@index_links) {
    my ($url, $index_letter) = @{$index_link};
    print "Checking authors found in the \"$index_letter\" index\n";
    $mech->get($url);
    my $html = $mech->content;
    $html =~ s/.+?(<div class=\"authorrow)/$1/si;
    $html =~ s/<br>.+?$//si;
    $mech->update_html($html);
    my @author_links = $mech->links;
    my $count = @author_links;
    print "Found $count authors for letter \"$index_letter\"\n\n";
    foreach my $author_link (@author_links) {
        my ($url, $author_name) = @{$author_link};
        print "************* $author_name **************\n";
        $mech->get($url);
        my $html = $mech->content;
        $html =~ s/.+?<dl>/$1/si;
        $html =~ s/<\/dl>.+?$//si;
        my $parser = HTML::TokeParser::Simple->new(\$html);
        my $pr;
        while (my $token = $parser->get_token) {
            if ($token->is_start_tag('dt')) {
                $pr = 1;
                next;
            }
            if ($pr && $token->is_text) {
                my $quote = $token->as_is;
                $Text::Wrap::columns = 72;
                $quote = wrap( '', '    ', $quote );
                print "$quote\n";
            }
            elsif ($token->is_end_tag('dt')) {
                $pr = 0;
            }
        }
    }
    last if ($index_letter eq "Z");
}

sdslrn123 · Jun 12, 2006

Code:

e.g. [URL unfurl="true"]http://en.wikipedia.org/wiki/A.html[/URL]

I don't understand why this is not a valid page?
It does not give a PAGE CANNOT BE DISPLAYED error?

sdslrn123 · Jun 13, 2006

Another thing is (believe it or not I have printed out the codes on paper and am manually going through them line by line. The line I find really confusing is:

Code:

$parent_pr = 1;

and then in the next line this value is used:

Code:

if ( $parent_pr && $parent_token->is_tag('a') ) {

To me this just means 1 and parent-token.
Would it not be the same if I did

Code:

$parent_token->is_tag('a') ) {

raklet · Jun 13, 2006

Sorry, got confused by the absence of content. Yes, a page is displayed, but it tells me there is not content found for the page entered. I see that you don't care about that, you just want the links on the page. But, when you says "just the main links" what do you mean by that? There are lots of links there. What qualifies as a main link?

raklet · Jun 13, 2006

If you just did

$parent_token->is_tag('a') ) {

Then you would capture all text found for ALL 'a' tags. So, lets say that you only want to capture the quotes, but the block of html you are evaluating not only has 'a' tags with the quotes, but also 'a' tags with links to advertisements, and any other number of garbage. With just the line of code that you suggested, your script would capture all off the text associated with 'a' tags - not just the quotes.

To fix this you have to find someway to uniquely identify the quotes. For example, you find that the quote is wrapped in a tag <dt class="quote"> and that no other item in the html uses this identifier. So now, you have a unique way to get at the quotes. So, you have to parse the html until you find this opening tag <dt class="quote">. Once you find it, you want to tell the program to start capturing text. But, you do not want to just go on capturing text for the rest of the program. When you find the end tag </dt> that corresponds to the one previous, you want to turn off capture until the next opening <dt> is encountered. That way you only get what you want and not the rest of the junk.

Code:

        if ($token->is_start_tag('dt')) { # quotes are inside the 'dt' tag
                $pr = 1; # found a 'dt' so start capture
                next;
            }
            if ($pr && $token->is_text) { # did we find text?  is capturing enabled?
                my $quote = $token->as_is; # we have a quote, so get it
                # do some nice formatting to the text
                $Text::Wrap::columns = 72;
                $quote = wrap( '', '    ', $quote );
                print "$quote\n";
            }
            elsif ($token->is_end_tag('dt')) { # are we at the end of the quote?
                $pr = 0;  # yep, so turn off the capture until the next quote
            }

sdslrn123 · Jun 13, 2006

Thanks again Raklet. After printing out your and Paul' codes it definitely makes alot of sense.

$pr = 1;
$pr = 0;
I still do not understand why these numbers are used. Is this a special case where 1 means something like true and 0 means something like false?

But, you'll be pleased I have made some progress... i think. Apart from one issue...

Code:

#!/usr/bin/perl

use strict;
use HTML::TokeParser::Simple;

my $baseurl;

my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

my $baseUrl = "[URL unfurl="true"]http://www.ietf.org/rfc/rfc2396.txt";[/URL] #stating baseurl
print "$baseUrl\n";

my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl ); #grabbing url straight from baseurl
    	
my $parent_token = $parent_parser->get_token; 
print OUT $parent_token->as_is,"\n";
}
close OUT;

Website was first I could find so sorry if it is random. How do I grab all the text?

sdslrn123 · Jun 13, 2006

LOL. Sorry, my mistake. It works!! Woo Hoo.
My next challenge is creating a CGI form.
THANK YOU EVERY1 and especially Raklet and PaulTEG!
Stars All Around!

raklet · Jun 13, 2006

$pr = 1;
$pr = 0;
I still do not understand why these numbers are used. Is this a special case where 1 means something like true and 0 means something like false?

Yes, you have the idea.

if $pr is defined (equal to 1 or 100 it doesn't matter)
capture text
else
do not capture text

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Opening a File on a Website in Text Format? 4

Technical User

Technical User

MIS

Technical User

IS-IT--Management

Technical User

Technical User

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

Technical User

MIS

MIS

Technical User

Technical User

MIS

Similar threads

Log in

Part and Inventory Search

Sponsor