Converting latin characters

cfsponge · Sep 15, 2006

I'm having some difficulty with having latin characters converted from either HTML entity or Word to stay converted. I'm using a replace that works on the first time, but subsequent form submissions corrupt them. A few letters, like á and é, always work. I'll use the letter ú for my example.

I have 2 possible filters for when converting for database storage:
sText = Replace(sText, "ú", "ú")
sText = Replace(sText, "Ãº", "ú")

However, it never keeps its HTML entity value when retrieving from the database. I get lots of additional question marks and other odd characters. Can anyone provide some assistance into this?

Oh, and I'm using UTF-8 so the charset isn't the problem.

tsuji · Sep 18, 2006

This is an issue full of intricuacy. Here are some notes for further look-into.

[1] Stability of ú (as an instance)
The stability should be achieved in the sense that the form element containing ú sent to the server and then response.write (or some equivalent device) back to the user agent.

[1.1] This stability can be done client-side for the form's html page having the meta data
[tt]<meta http-equiv="content-type" content="text/html; charset="utf-8"> [/tt]

[1.2] At the server-side, the page with the form data submitted to has its codepage set to for instance 65001 (something equivalent to utf-8)
[tt]session.codepage=65001[/tt]

[1.3] When response.write request("input_name"), the server should have in its asp (or some equivalent server-side tech)
[tt]response.charset="utf-8"[/tt]

[1.4] With the above client-->server-->client communication, the ú should be stable and show up on the user agent correctly.

[2] If the server-side involving processing with some dbase, it involves some new complication.

[2.1] In that case, the request("input_name") data containing ú should be mapped to ucs-2. My naive way of doing it is this.
[tt]x=unescape(replace(escape(request("input_name")),"%FA","%U77C7"))[/tt]
With x, it is stand-ready for querying/interacting with most dbase.

[2.2] If the ú is what we pull out from dbases described in the article in [3], the same happens but in reverse order.
[tt] y=unescape(replace(escape(rs("some_field")),"%U77C7","%FA"))[/tt]
With y, it is stand-ready to be response.write to the page being served to the client with charset set again to utf-8.

[2.3] Why all it sounds entangling, it is just because I'm not good enough to use high powered jargon to say it. It may sound more involved than it really is.

[3] This msdn article may help to see the big picture for case involving db server-side.

http://support.microsoft.com/default.aspx?scid=kb;en-us;232580

You probably are using cf and not using asp, but the essential maybe is still the same.

cfsponge · Sep 18, 2006

Thank you for the excellent resource. When you say "dbase", you are referring to using any database for storage of the information? I want to make sure you don't specifically mean dBase type.

tsuji · Sep 18, 2006

I just meant those listed in the reference article:
[tt] "Microsoft Windows NT, SQL Server, Java, COM, and the SQL Server ODBC dirver and OLEDB provider all internally represent Unicode data as UCS-2."[/tt]
Impossible to know every possible tool, just not enough resource available at my disposal... somebody may know better.

cfsponge · Sep 18, 2006

After doing all the above research, all I had to do was add the session.codepage=65001 to my ASP pages that the submissions happen on. Since I was already using utf-8 for my character set, this was great. Again, thanks for your help.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Converting latin characters

cfsponge

Programmer

tsuji

Technical User

cfsponge

Programmer

tsuji

Technical User

cfsponge

Programmer

Similar threads

Part and Inventory Search

Sponsor