Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HTML to CSV 1

Status
Not open for further replies.

YOUNGCODGER

Programmer
Jul 9, 2007
102
GB
Hello,

I have written a script, using “wget”, to download a number of websites which feature one or more spreadsheets. An example is:-


What I want to do is extract the first two or three columns into a CSV file, optionally with/without the headers.

Part of the html is:-

datacelllefttopborder
"><span><P>0551 100</P></span></Td><Td c
lass="datacell content
datacelllefttopborder
"><span><P>g6</P></span></Td><Td class="
datacell content
datacelllefttopborder
"><span><P>2</P></span></Td><Td class="d
atacell content
datacelllefttopborder
"><span><P>N</P></span></Td><Td class="r
ightborder datacell content
rightborder datacelllefttopborder
"><span><P>N</P></span></Td></TR><TR cla
ss="datarow"><Td class="datacell content
datacelllefttopborder
"><span><P>0551 107</P></span></Td><Td c
lass="datacell content
datacelllefttopborder
"><span><P>g21</P></span></Td><Td class=
"datacell content

And what I want to achieve is:-

‘0500,no fee
‘0551100,g6
‘0551107,g21
etc.

Thanks in anticipation,

YoungCodger [bigglasses]
 
Hi

What have you tried so far ?

Personally I would take a look at various text-mode browsers' ( [tt]lynx[/tt], [tt]links[/tt], [tt]elinks[/tt], [tt]w3m[/tt] ) output with -dump option. May be easier to parse.

Feherke.
 
Many thanks for this. "Links" seems to be work best on this family of sites. The only irritation is that a few cells that are OK on screen wrap around on the dump creating extra lines that are blank otherwise! There is another site that I want data from and this seems to mess up whatever one uses.

Regards,

YoungCodger [bigglasses]
 
Hi

YoungCodger said:
The only irritation is that a few cells that are OK on screen wrap around on the dump creating extra lines that are blank otherwise!
You can somehow reduce the wrapping with the -width option. Sadly my [tt]links[/tt] accepts width values only up to 512.

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top