Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Converting HTML special characters to ASCII

Status
Not open for further replies.

arcticpuppy

Technical User
Jan 14, 2003
8
US
I am parsing a html file that has special characters for &, !, etc and I would like to convert these automatically for ascii output(tab delimeted file) any hints ?
 
How about gsub?

You may need to escape the char, but so what?
Won't that work for you?
ex:
gsub(/&/,"\t",$0)..etc..


If this doesn't work there are plenty of facilities
available in gawk(the index function for one)that will
allow you to write user functions for the changes.

Post a data sample for more help.
 
great news about gawk but what do you call

...gsub would be tedious and require every sprecial character to be translated (i.e. a gsub for every accent, &, ! etc...)

sample data
Data Resulting Data
---- ---------------
Amélie = Amelie (with accents)
Lilo & Stitch = Lilo & Stich
Alien&#17 = Alien3 (cubed 3)
 
Ther's going to be problems with this because the characters are special.

If gsub and feeding gsub() the unprintable chars in some manner does not work.
Code:
BEGIN {
nprintable="@#!$%^&-_+=?:;"
printable="\t"
z = split(nprintable,arr,"")
}

   {
      for (i=1 ; i <= z ; i++) {
          p = gsub(arr[i],printable,$0)
          print p
      }
   }
You are left with only a few other alternatives.
1) Trying again with hex translations of the unprintables
which I think you are going to have to use in some cases.
Sprintf is very helpful with this.

2) Use either index or a method like this to replace every bad character:
Code:
 while( (p = index($0,badchar)) > 0) {
       $0 = substr($0,1,(p - 1)) printable substr($0,(p + 1),length($0) - (p + 1))
    }
Which is totally untested and I have found index() to
be very flaky for me trying some of these solutions.

3) C might be worth looking into for this.

 
sorry about the last post of data I did not preview the post

unfortunatgly the web nature of the forum is interpreting the ascii text to web

so note this is what I meant to send but ignore the spaces in the first field (I have spaced the data so that the HTML nature of the forum will not interpret the data)

Initial html data = ascii data
D u m b & # 3 8 ; D u m b e r = Dumb & Dummer
A m & # 2 3 3 ; l i e = Amélie

what I have is html data and I want it as ascii data
gsub does not work only because there are too many variations
 
Does each sequence you want to replace have the same
regex pattern?
That is: (&#.*;)?

If so one way to handle it in awk is by building a db of
translation names in a flat file for every movie that
you have translations for.
This done, load the db_file into an awk array, gsub out
the regex pattern above and do relative comparisons
against the db_file movie names.

It won't be 100% all the time but I think it could be
close if you anticipate the removal of special
characters in your database naming scheme.
Dumb & Dumber would be Dumb and Dumber and you could
match by field to present a % match to get the best
match given identical fields.

In any case it's a large undertaking IMO.
 
Here's my attempt. The only way I could think of to convert an ascii value into the corresponding ascii character is to create a table. I just created a 2 element table to take care of the 2 values in your test data. You would have to expand it to all possible values.

BEGIN {a[38]=&quot;&&quot;;a[233]=&quot;é&quot;}
{
k = 1
s0 = &quot;&quot;
s = $0
j = match(s,/&#[^;]*;/)
while (j) {
ix = substr(s,j+2,RLENGTH-3)
k += j+RLENGTH-1
s0 = s0 substr(s,1,j-1) a[ix+0]
s = substr(s,j+RLENGTH)
j = match(s,/&#[^;]*;/)
}
s0 = s0 substr($0,k)
print s0
}
~
~ CaKiwi
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top