how to ensure wget output is not garbled 1

arunrr · Dec 18, 2010

I am looking for a way to ensure that the wget output is not garbled.

The wget man pages refers to the use of the -c option which i am not using. I have tried with -o and without.

Thanks for your help...
Arun

feherke · Dec 18, 2010

Hi

Could you define "garbled" ? Could you show a sample ? Could you tell what the [tt]file[/tt] command says about that file ? Could you tell us the URL from where you downloaded ?

Feherke.

http://free.rootshell.be/~feherke/

arunrr · Dec 18, 2010

i have attached a sample of the output html.

The URL is...

http://www.espncricinfo.com/ci/content/records/223646.html

The wget command i use is...
wget -nv -nd -l 1 $URL -O $DIR/tt.html -o $DIR/wget.log

Unfortunately, in this shared hosting environment, with my SSH access, i dont have access to the 'file' command.

Thanks.

feherke · Dec 20, 2010

Hi

That is weird. I was absolutely sure that [tt]wget[/tt] is able to handle content encoding. But seems it is not. You can do two things :
[ul]
[li]Ask Apache to not encode the content

Code:

wget --header='Accept-encoding: identity' -O $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL]

[/li]
[li]Decode it yourself after downloading

Code:

wget -O $DIR/tt.html.gz [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL]
gunzip $DIR/tt.html.gz

[/li]
[/ul]
For details about what happens there, read the HTTP compression article on Wikipedia.

Feherke.

http://free.rootshell.be/~feherke/

arunrr · Dec 20, 2010

Hello Feherke,

Great input. Thanks!!

In the first option, wouldn't using...

--header='Accept: text/html'

ensure that i get a plain html file every time?

Thanks,
Arun

feherke · Dec 20, 2010

Hi

Arun said:
In the first option, wouldn't using...

--header='Accept: text/html'

ensure that i get a plain html file every time?

[tt]Accept[/tt] says what [tt]Content-type[/tt] the user agent accepts.

[tt]Accept-encoding[/tt] says what [tt]Content-encoding[/tt] the user agent accepts. In other words it means how the content can be delivered.

( The [tt]Content-encoding[/tt] header name is quite misleading, as it means how the content is transferred, not what the content is. The [tt]Transfer-encoding[/tt] name would be much more appropriate, but that name is already taken and denotes completely different thing. )

See RFC 2616 for the explanation of all those.

Feherke.

http://free.rootshell.be/~feherke/

arunrr · Dec 20, 2010

One more quick question. In your second option, how can I test if the file is gzipped and then...

mv tt.html tt.html.gz
gunzip tt.html.gz

The reason for this is on occasion, the resulting wget output is normal.

Thanks,
Arun

feherke · Dec 20, 2010

Hi

Arun said:
how can I test if the file is gzipped

You have two options :
[ul]
[li]See what Apache says

Code:

[highlight][navy]enc[/navy][teal]=[/teal][green][i]"$([/i][/green][/highlight] [green][i]wget [highlight]-S[/highlight] -O $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL][/i][/green] [highlight][green][i]2>&1 | grep -i 'content-encoding' )"[/i][/green][/highlight]
[highlight]expr [green][i]"$enc"[/i][/green] [teal]:[/teal] [green][i]'.*gzipx'[/i][/green] [teal]>[/teal] /dev/null[/highlight] [teal]&&[/teal] {
  mv tt[teal].[/teal]html tt[teal].[/teal]html[teal].[/teal]gz
  gunzip tt[teal].[/teal]html[teal].[/teal]gz
}

[/li]
[li]Ask [tt]file[/tt]

Code:

wget -O [navy]$DIR[/navy]/tt[teal].[/teal]html http[teal]:[/teal]//www[teal].[/teal]espncricinfo[teal].[/teal]com/ci/content/records[teal]/[/teal][purple]223646[/purple][teal].[/teal]html
[highlight][navy]enc[/navy][teal]=[/teal][green][i]"$( "[/i][/green]file [navy]$DIR[/navy]/tt[teal].[/teal]html[green][i]" )"[/i][/green][/highlight]
[highlight]expr [green][i]"$enc"[/i][/green] [teal]:[/teal] [green][i]'.*gzipx'[/i][/green] [teal]>[/teal] /dev/null[/highlight] [teal]&&[/teal] {
  mv tt[teal].[/teal]html tt[teal].[/teal]html[teal].[/teal]gz
  gunzip tt[teal].[/teal]html[teal].[/teal]gz
}

[/li]
[/ul]
Note that your [tt]file[/tt] syntax may be abit different.

Feherke.

http://free.rootshell.be/~feherke/

feherke · Dec 20, 2010

Hi

Grr !!! I banged it. Of course, the regular expression should be [green]'.*gzip'[/green]. I added the x only to test the not matching case, then forgot to remove it.

Feherke.

http://free.rootshell.be/~feherke/

feherke · Dec 20, 2010

Hi

Or just see if you have [tt]curl[/tt], it seems to handle [tt]Content-encoding[/tt] :

Code:

curl -L -o $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL]

Feherke.

http://free.rootshell.be/~feherke/

arunrr · Dec 20, 2010

Hi Feherke,

The option to ask Apache works best for me

THANKS,
Arun

feherke · Dec 21, 2010

Hi

Please note that my code does a dumb check. If you really want to use that, better implement a stricter check.

For example, if a redirect will occur ( as it does in your URL's case ), two ( or even more ) request's HTTP headers will be displayed one after another, so [tt]grep[/tt] will pick any [tt]Content-encoding[/tt] header listed. So better check only the headers of the response with status code 200 :

Code:

[navy]enc[/navy][teal]=[/teal][green][i]"$( wget -S -O $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL] 2>&1 | [highlight]sed -n '/HTTP.* 200  *OK/I,${/Content-encoding/Ip}'[/highlight] )"[/i][/green]
expr [green][i]"$enc"[/i][/green] [teal]:[/teal] [green][i]'.*gzip'[/i][/green] [teal]>[/teal] /dev/null [teal]&&[/teal] {
  mv tt[teal].[/teal]html tt[teal].[/teal]html[teal].[/teal]gz
  gunzip tt[teal].[/teal]html[teal].[/teal]gz
}

Tested with GNU [tt]sed[/tt].

Note that handling HTTP headers with simple Unix command-line tools is not so trivial. Even the above enhanced approach will fail if the HTTP header is wrapped on multiple lines.

Feherke.

http://free.rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

how to ensure wget output is not garbled 1

arunrr

Programmer

feherke

Programmer

arunrr

Programmer

feherke

Programmer

arunrr

Programmer

feherke

Programmer

arunrr

Programmer

feherke

Programmer

feherke

Programmer

feherke

Programmer

arunrr

Programmer

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor