Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to copy pdf file from webpage locally

Status
Not open for further replies.

qjade

Programmer
Jun 7, 2004
42
US
Hello friends,
I am trying to copy a webpage (which is just a link to a pdf file) on our intranet. I need to copy it locally so that I can schedule a task to email it daily to a specific recipient daily. I have done a similar job in perl where I crawl a page and save its content - but that is only as a plain text format.
Can something like this be done even though it is in the "funky" pdf format? Please advise on any suggestions on how this can be done in Perl? Should I look for another option/language instead to get this done?
Thank you for reading and I appreciate any help I can get.
 
Use See the following from the examples section of Mechanize documentation. All documentation can be found here:
Steve McConnell, author of the landmark Code Complete has put up the chapters for the 2nd edition in PDF format on his website. I needed to download them to take to Kinko's to have printed. This little script did it for me.

Code:
    #!/usr/bin/perl -w

    use strict;
    use [URL unfurl="true"]WWW::Mechanize;[/URL]

    my $start = "[URL unfurl="true"]http://www.stevemcconnell.com/cc2/cc.htm";[/URL]

    my $mech = [URL unfurl="true"]WWW::Mechanize->new([/URL] autocheck => 1 );
    $mech->get( $start );

    my @links = $mech->find_all_links( url_regex => qr/\d+.+\.pdf$/ );

    for my $link ( @links ) {
        my $url = $link->url_abs;
        my $filename = $url;
        $filename =~ s[^.+/][];

        print "Fetching $url";
        $mech->get( $url, ':content_file' => $filename );

        print "   ", -s $filename, " bytes\n";
    }
 
Thank you for responding raklet!

Your code seems to work perfectly (after I changed the start = " However, it does not seem to work for my case. Does it have anything to do with the fact that my *.pdf file is located on our intranet and not internet? I tried hardcode in the $url = " but still no luck.
If you have any other suggestion or advise I am all ears. Again, just to restate the fact that there is only one *.pdf file in this directory and I do not need to crawl any addition sub-level.

Thanks a million again.
 
It may have something to do with your server running on a nonstandard port (888). I found some developer threads talking about how LWP only works over port 80 in certain instances (Mechanize is a derivative of LWP and uses the same underlying code). But, I did not find any answers that solve the problem. Good luck.

Raklet
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top