Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Wget problem ...

Status
Not open for further replies.

dbeez

Technical User
Aug 15, 2005
71
0
0
KR
I'm trying to run wget against a web page but it seems to be having difficulties ...
Code:
wget -r -l 2 -v -np -O raw.txt [URL unfurl="true"]http://www.webpage.com[/URL]
The code returns this only ...
Code:
root@ubuntu:/home/babo/Desktop/spider_proj # ./spider
--15:09:01--  [URL unfurl="true"]http://www.websie.com/xx/xx[/URL]
           => `raw.txt'
Resolving [URL unfurl="true"]www.website.com...[/URL] 207.171.1.2
Connecting to www.website.com[207.171.1.2]:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: [URL unfurl="true"]http://www.website.com/xxx/xxx/[/URL] [following]
--15:09:02--  [URL unfurl="true"]http://www.website.com/xxx/xxx/[/URL]
           => `raw.txt'
Connecting to www.website.com[207.171.1.2]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [      <=>                            ] 98,655        59.67K/s

15:09:04 (59.61 KB/s) - `raw.txt' saved [98,655]

Loading robots.txt; please ignore errors.
--15:09:04--  [URL unfurl="true"]http://www.website.com/robots.txt[/URL]
           => `raw.txt'
Connecting to www.website.com[207.171.1.2]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
15:09:05 ERROR 404: Not Found.

[URL unfurl="true"]www.website.com/xxx/xxx/index.html:[/URL] No such file or directory

FINISHED --15:09:05--
Downloaded: 98,655 bytes in 1 files

it doesn't matter which website I point it at, I still get the same problem ...

Help !
 
Like it says "Loading robots.txt; please ignore errors.". It successfully downloaded the web page, but as described on the man page, it attempts to download robots.txt as well:

[tt] Wget can follow links in HTML pages and create local ver­
sions of remote web sites, fully recreating the directory
structure of the original site. This is sometimes
referred to as ``recursive downloading.'' While doing
that, Wget respects the Robot Exclusion Standard
(/robots.txt). Wget can be instructed to convert the
links in downloaded HTML files to the local files for
offline viewing.[/tt]

And fails.

Annihilannic.
 
Errr ... ok then.

Any idea what I do to make it work ??
Why does it fail on every site ??

I've changed it to
Code:
wget -r l 2 -np -x -R txt,jpg,gif -O raw.txt [URL unfurl="true"]http://www.website.com[/URL]

Thanks A
 
robots.txt job is to report a site's policy for automatic spiders, i.e. a webmaster can prevent spiders from accessing certain content, or return the content in a different format that is more suitable for spiders. You should respect that convention rather than attempting to avoid it. You can safely ignore those messages.

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top