Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

web_downloader.py python tool for downloading entire/portions of a website

Status
Not open for further replies.

jbrearley

Programmer
Nov 26, 2012
9
0
0
CA
Here is a robust utility web_downloader.py that will recursively walk down a directory of the specified website or the entire website.

Options supported:
python web_downloader.py -h
usage: web_downloader.py [-h] [-all] [-checkurl] [-data] [-file FILE]
[-levels LEVELS] [-match MATCH] [-new]
[-outdir OUTDIR] [-prefix PREFIX] [-query]
[-replace REPLACE] [-size SIZE]
[-timedelay TIMEDELAY] [-website WEBSITE]
[-username USERNAME] [-verbose]

optional arguments:
-h, --help show this help message and exit
-all Scrape ALL text files for more image files and
embedded links. This option adds significant extra
processing and downloading time. By default only the
subdirectory index.html listings are scraped.
-checkurl By default, the requested url is checked against the
received url to help detect loops. Errors are logged
in case of mismatch and recovery action is taken. This
option will log the issues as Warnings only, with no
recovery action taken.
-data Dont use history & queue data files. Default uses
these data files for faster processing and faster
restart of the script.
-file FILE Specific file to download. This option is rarely used.
Typically you want the entire contents of a specific
website directory, or possibly the entire website
contents.
-levels LEVELS Maximum website subdirectory levels to search.
-match MATCH Regexp matching pattern for selecting files.
-new Get only new files by downloading the latest directory
listing for scraping, dont use the local disk copy.
Default is to use local disk copy if available.
-outdir OUTDIR Directory to save files in. default=c:\saved_docs
-prefix PREFIX If needed for SSL, specify: https:// Default is:
http://
-query Dont use http GET strings as part of links to be
followed. Default uses the GET strings in the links.
-replace REPLACE Replace / overwrite existing files if older than 48
hours. 0=always replace.
-size SIZE Maximum file size to download. 0=unlimited.
5MB=default.
-timedelay TIMEDELAY Maximum random time delay, in seconds, between
downloading each file. 0=default.
-website WEBSITE Website path to access. If you want just a specific
directory from a website, then include the directory
as part of the website path. localhost=default.
-username USERNAME Optional username to login to website with. The script
will prompt for the associated password.
-verbose Shows optional verbose details. Logfile will be
HUGE!!!

For reading large log files > 50MB, you may want use the utility file_tail.py, which by default will copy the last 20MB to a separate file.

web_downloader.py stores state information in .hist and .queue files. These files are used for a fast restart of the script to quickly pick up where it left off should the script get interrupted. If you need to compact or edit these large files, the utility web_comp_files.py will be of use.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top