Wget problem ...

dbeez · Oct 31, 2005

I'm trying to run wget against a web page but it seems to be having difficulties ...

Code:

wget -r -l 2 -v -np -O raw.txt [URL unfurl="true"]http://www.webpage.com[/URL]

The code returns this only ...

Code:

root@ubuntu:/home/babo/Desktop/spider_proj # ./spider
--15:09:01--  [URL unfurl="true"]http://www.websie.com/xx/xx[/URL]
           => `raw.txt'
Resolving [URL unfurl="true"]www.website.com...[/URL] 207.171.1.2
Connecting to www.website.com[207.171.1.2]:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: [URL unfurl="true"]http://www.website.com/xxx/xxx/[/URL] [following]
--15:09:02--  [URL unfurl="true"]http://www.website.com/xxx/xxx/[/URL]
           => `raw.txt'
Connecting to www.website.com[207.171.1.2]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [      <=>                            ] 98,655        59.67K/s

15:09:04 (59.61 KB/s) - `raw.txt' saved [98,655]

Loading robots.txt; please ignore errors.
--15:09:04--  [URL unfurl="true"]http://www.website.com/robots.txt[/URL]
           => `raw.txt'
Connecting to www.website.com[207.171.1.2]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
15:09:05 ERROR 404: Not Found.

[URL unfurl="true"]www.website.com/xxx/xxx/index.html:[/URL] No such file or directory

FINISHED --15:09:05--
Downloaded: 98,655 bytes in 1 files

it doesn't matter which website I point it at, I still get the same problem ...

Help !

Annihilannic · Nov 1, 2005

Like it says "Loading robots.txt; please ignore errors.". It successfully downloaded the web page, but as described on the man page, it attempts to download robots.txt as well:

[tt] Wget can follow links in HTML pages and create local ver
sions of remote web sites, fully recreating the directory
structure of the original site. This is sometimes
referred to as ``recursive downloading.'' While doing
that, Wget respects the Robot Exclusion Standard
(/robots.txt). Wget can be instructed to convert the
links in downloaded HTML files to the local files for
offline viewing.[/tt]

And fails.

Annihilannic.

dbeez · Nov 4, 2005

Errr ... ok then.

Any idea what I do to make it work ??
Why does it fail on every site ??

I've changed it to

Code:

wget -r l 2 -np -x -R txt,jpg,gif -O raw.txt [URL unfurl="true"]http://www.website.com[/URL]

Thanks A

Annihilannic · Nov 5, 2005

robots.txt job is to report a site's policy for automatic spiders, i.e. a webmaster can prevent spiders from accessing certain content, or return the content in a different format that is more suitable for spiders. You should respect that convention rather than attempting to avoid it. You can safely ignore those messages.

Annihilannic.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Wget problem ...

dbeez

Technical User

Annihilannic

MIS

dbeez

Technical User

Annihilannic

MIS

Similar threads

Part and Inventory Search

Sponsor