Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

how to ensure wget output is not garbled 1

Status
Not open for further replies.

arunrr

Programmer
Oct 2, 2009
103
0
0
US
I am looking for a way to ensure that the wget output is not garbled.

The wget man pages refers to the use of the -c option which i am not using. I have tried with -o and without.

Thanks for your help...
Arun
 
Hi

Could you define "garbled" ? Could you show a sample ? Could you tell what the [tt]file[/tt] command says about that file ? Could you tell us the URL from where you downloaded ?


Feherke.
 
i have attached a sample of the output html.

The URL is...

The wget command i use is...
wget -nv -nd -l 1 $URL -O $DIR/tt.html -o $DIR/wget.log

Unfortunately, in this shared hosting environment, with my SSH access, i dont have access to the 'file' command.

Thanks.
 
Hi

That is weird. I was absolutely sure that [tt]wget[/tt] is able to handle content encoding. But seems it is not. You can do two things :
[ul]
[li]Ask Apache to not encode the content
Code:
wget --header='Accept-encoding: identity' -O $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL]
[/li]
[li]Decode it yourself after downloading
Code:
wget -O $DIR/tt.html.gz [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL]
gunzip $DIR/tt.html.gz
[/li]
[/ul]
For details about what happens there, read the HTTP compression article on Wikipedia.


Feherke.
 
Hello Feherke,

Great input. Thanks!!

In the first option, wouldn't using...

--header='Accept: text/html'

ensure that i get a plain html file every time?

Thanks,
Arun
 
Hi

Arun said:
In the first option, wouldn't using...

--header='Accept: text/html'

ensure that i get a plain html file every time?
[tt]Accept[/tt] says what [tt]Content-type[/tt] the user agent accepts.

[tt]Accept-encoding[/tt] says what [tt]Content-encoding[/tt] the user agent accepts. In other words it means how the content can be delivered.

( The [tt]Content-encoding[/tt] header name is quite misleading, as it means how the content is transferred, not what the content is. The [tt]Transfer-encoding[/tt] name would be much more appropriate, but that name is already taken and denotes completely different thing. )

See RFC 2616 for the explanation of all those.

Feherke.
 
One more quick question. In your second option, how can I test if the file is gzipped and then...

mv tt.html tt.html.gz
gunzip tt.html.gz

The reason for this is on occasion, the resulting wget output is normal.

Thanks,
Arun
 
Hi

Arun said:
how can I test if the file is gzipped
You have two options :
[ul]
[li]See what Apache says
Code:
[highlight][navy]enc[/navy][teal]=[/teal][green][i]"$([/i][/green][/highlight] [green][i]wget [highlight]-S[/highlight] -O $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL][/i][/green] [highlight][green][i]2>&1 | grep -i 'content-encoding' )"[/i][/green][/highlight]
[highlight]expr [green][i]"$enc"[/i][/green] [teal]:[/teal] [green][i]'.*gzipx'[/i][/green] [teal]>[/teal] /dev/null[/highlight] [teal]&&[/teal] {
  mv tt[teal].[/teal]html tt[teal].[/teal]html[teal].[/teal]gz
  gunzip tt[teal].[/teal]html[teal].[/teal]gz
}
[/li]
[li]Ask [tt]file[/tt]
Code:
wget -O [navy]$DIR[/navy]/tt[teal].[/teal]html http[teal]:[/teal]//www[teal].[/teal]espncricinfo[teal].[/teal]com/ci/content/records[teal]/[/teal][purple]223646[/purple][teal].[/teal]html
[highlight][navy]enc[/navy][teal]=[/teal][green][i]"$( "[/i][/green]file [navy]$DIR[/navy]/tt[teal].[/teal]html[green][i]" )"[/i][/green][/highlight]
[highlight]expr [green][i]"$enc"[/i][/green] [teal]:[/teal] [green][i]'.*gzipx'[/i][/green] [teal]>[/teal] /dev/null[/highlight] [teal]&&[/teal] {
  mv tt[teal].[/teal]html tt[teal].[/teal]html[teal].[/teal]gz
  gunzip tt[teal].[/teal]html[teal].[/teal]gz
}
[/li]
[/ul]
Note that your [tt]file[/tt] syntax may be abit different.

Feherke.
 
Hi

Or just see if you have [tt]curl[/tt], it seems to handle [tt]Content-encoding[/tt] :
Code:
curl -L -o $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL]


Feherke.
 
Hi Feherke,

The option to ask Apache works best for me

THANKS,
Arun
 
Hi

Please note that my code does a dumb check. If you really want to use that, better implement a stricter check.

For example, if a redirect will occur ( as it does in your URL's case ), two ( or even more ) request's HTTP headers will be displayed one after another, so [tt]grep[/tt] will pick any [tt]Content-encoding[/tt] header listed. So better check only the headers of the response with status code 200 :
Code:
[navy]enc[/navy][teal]=[/teal][green][i]"$( wget -S -O $DIR/tt.html [URL unfurl="true"]http://www.espncricinfo.com/ci/content/records/223646.html[/URL] 2>&1 | [highlight]sed -n '/HTTP.* 200  *OK/I,${/Content-encoding/Ip}'[/highlight] )"[/i][/green]
expr [green][i]"$enc"[/i][/green] [teal]:[/teal] [green][i]'.*gzip'[/i][/green] [teal]>[/teal] /dev/null [teal]&&[/teal] {
  mv tt[teal].[/teal]html tt[teal].[/teal]html[teal].[/teal]gz
  gunzip tt[teal].[/teal]html[teal].[/teal]gz
}
Tested with GNU [tt]sed[/tt].

Note that handling HTTP headers with simple Unix command-line tools is not so trivial. Even the above enhanced approach will fail if the HTTP header is wrapped on multiple lines.


Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top