Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Problem with Eastern European languages in reports 2

Status
Not open for further replies.

mibosoft

Programmer
Jul 23, 2002
106
SE
Hi,
I have a VFP9 web application where I want to add support for the Czech language. It all works fine except for the reports. If Czech words are entered via the web interface, they are stored in the table like this:
vfp_table_wxu9sw.jpg


When I list these words again in the web browser, it looks perfectly fine, like this:
vfp_web_browser_cimoec.jpg


The problem is in the reports where the contents of the table is listed except for a couple of letters like for example "Š" that are actually shown correctly:
vfp_report_hoft2d.jpg


My current codepage is 1252 and I have tried other codepages, STRCONV(), etc, but I can never get VFP to show the original texts again. Is information lost once the words are written to the table? If so, how come that the web browser is able to show it correctly? Any hints how to get the reports to show the original texts?

BR,
Micael
 
The web page likely is UTF-8. It doesn't matter what codepage your DBF is, looking into it you see some garbled characters, but they still make up the bytes representing the correct UTF-8 charcters, so there's no problem.

The report output shows your Web page input also contains some HTML entities, those will only be displayed correctly in a browser.

If you want to print data it has to limit itself to the DBF encoding and that has to match current codepage or you make use of report controls that are capable to show your data cirrectly or you don't print normal reports but output HTML pages to print, there's a lot you can do in CSS to have printable HTML, see
If you intend to stay VFP reports you have to clean your data so it a) has no HTML entities and b) is conforming to the codepage choice you make. 1252 is Western European, you do have a better choice with 1250 Eastern European Codepage supported on Eastern European Windows versions.

Bye, Olaf.


Olaf Doschke Software Engineering
 
Aha, I didn't realize that the data stored in the tables contains html formatters. Probably due to that the data is htmlencoded when it is sent via POST from the browser. I guess I need to make sure that the original text is stored. I will try the 1250 codepage. The languages I need to support are Swedish, Finnish, Norwegian, English and Czech.

I need to stay with the VFP reports as they are pretty complex and would take a lot of tome to convert to something else.

Thank you for this input.
 
I guess the HTML entities already come from the Browsers POST request, to only get characters of a given codepage is maybe not strictly possible from HTML input controls. When users paste in something containing UTF-8 and the webpage uses say latin1, that might cause some characters to be converted to HTML entities, so the webpage can still display them. Foxpro doesn't share that strategy.

Essentially HTML entities are those &äxyz where xyz is a decimal number in the range of 0-255, so CHR(xyz). It's not guaranteed to be the char really entered, when not using the same encoding as the webpage. Besides, there are HTML entity names like > for the greater than character. Since there are many places where the codes can change, this surely will not be straight forward. If your server side script uses functions for sanitizing user input the HTML enitities might come from there and circumvent characters used in SQL injection attacks, for the price of them only being compatible for web page output, then you should instead parameterize insert queries.

You also don't tell whether your web scripts store directly into DBFs or whether this goes through a MySQL database, then there already can be two further encoding changes from web to MySQL and then from MySQL to DBF.

It may be best to work with UTF-8 all the way including how you get the text into VFP variables and then use STRCONV() to convert from UTF-8 to the current codepage with STRCONV(utf8text,11).

Bye, Olaf.

Olaf Doschke Software Engineering
 
I use winsock and take care of the POST buffer and store directly to a vfp table. Here you see how the text začátečníkům looks like in the debugger as an example:
vfp_debug_zav2nf.png


How can I treat the texts as UTF8 all the way like you suggest? The text above is stored as "začátečníkům" as it is now. Should I sanitize the input before storing?
 
You have to make the webpage to work in UTF-8 encoding. If that's not under your control, you have to dig into decoding this here. You need to know what the webpage uses as encoding.

Bye, Olaf.

Olaf Doschke Software Engineering
 
I have full control over the web page and when looking in Chrome, the code page is interpreted as 1252 even though I say UTF8 in the http header:
vfp_web_encoding_urvmd7.jpg


Changing to storing strings as UTF8 is to risky as I have about 1500 strings and probably do string matching here and there. I'm thinking of storing the Czech strings including the HTML entities (as it works fine in the web environment) and strip the strings on the fly only when producing reports.
 
Mibosoft,

You can post-process the HTML entities and prepare the VFP data to be displayed using Eastern Europe encoding.

This is a small function to help you with that (not fully tested, but I think it's fairly ok):

Code:
LOCAL Encoding AS Integer

m.Encoding = 238

? HTMLNumEntityToANSI("FBŠ SLAVIA Plzeň", m.Encoding) Font "Arial", 12, m.Encoding
? HTMLNumEntityToANSI("nápovědě", m.Encoding) Font "Arial", 12, m.Encoding

FUNCTION HTMLNumEntityToANSI (Source AS String, ANSIEncoding AS Integer)

	LOCAL Encoded AS String
	LOCAL NumEntity AS String
	LOCAL Codepoint AS String

	m.Encoded = m.Source
	m.NumEntity = STREXTRACT(m.Encoded, "&#", ";", 1, 4)
	DO WHILE !EMPTY(m.NumEntity)

		m.Codepoint = BINTOC(VAL(SUBSTR(m.NumEntity, 3)), "2RS") && this is the UNICODE codepoint for the entity
		m.Encoded = STRTRAN(m.Encoded, m.NumEntity, STRCONV(m.Codepoint, 6, m.ANSIEncoding, 2)) && make it ANSI, if possible

		m.NumEntity = STREXTRACT(m.Encoded, "&#", ";", 1, 4)
	ENDDO

	RETURN m.Encoded

ENDFUNC
 
This is awesome! Then I can store the strings including HTML entities and use this function to return strings used in my reports. Thanks' a lot!
 
The encoding isn't just specified by HTML, the server sends a header specifying Windows-1252, obviously.

This is something you'd need to address in the Webserver Configuration defining the HTTP Content-Type header sent with any HTTP response.


Likewise, look into the enctype attribute of HTML forms.

Bye, Olaf.



Olaf Doschke Software Engineering
 
I have written the server myself using winsock directly and I can change the header to UTF8 and if I do that, texts stored in tables containing the Swedish letters åäö will not be shown correctly.

I'm a bit confused here. What do I need to do to use UTF8 all the way from storing texts in tables, using in the web application, in reports, etc?
 
Of course you also need to convert your ANSI output to UTF8, just sending the header does no automatic conversion, the server has no idea what the original codepage is, so you got to feed UTF8 and expect UTF8

The big advantage using UTF8 on the complete roundtrip on the Browser side is that when users copy in texts or enter anything, this is UTF8 anyway. UTF8 allows the most input.

Only part of it can be converted to codepage 1250, but you won't have HTML entities (ϧ) in the incoming data, so you spare the conversion of that, you only need the STRCONV(input,11) to the then current codepage 1250.

You should program in 1250 and run in that codepage automatically on Eastern Windows, if not via config.fpw CODEPAGE=1250 setting.

Bye, Olaf.

Olaf Doschke Software Engineering
 
So I need to store all of my current strings as strconv(ANSI-STRING,9) in the table to be UTF8 and use STRCONV(TABLE-STRING,11) in reports etc. A problem is that lower()/upper() does not work on a UTF8 string which I sometimes do today in web layouts. It is not shown correctly in the browser.
 
Mibosoft,

You may use the Windows API to perform such transformations on UTF-8 strings, just make them UNICODE before calling the relevant API functions.

This example takes the Dostoevsky's name in Russian in UTF-8, displays it using an ANSI encoding, then in upper case, and finally in lower case.

Code:
CLEAR

DECLARE INTEGER CharUpperBuffW IN User32 AS W32_Unicode_Upper ;
	STRING @ UnicodeSring, INTEGER StringLength
DECLARE INTEGER CharLowerBuffW IN User32 AS W32_Unicode_Lower ;
	STRING @ UnicodeSring, INTEGER StringLength

LOCAL Dostoevsky AS String

LOCAL DostoevskyUTF8 AS String
LOCAL DostoevskyUNICODE AS String

m.DostoevskyUTF8 = STRCONV("d094d0bed181d182d0bed0b5d0b2d181d0bad0b8d0b92c20d0a4d191d0b4d0bed18020d09cd0b8d185d0b0d0b9d0bbd0bed0b2d0b8d187", 16)

? m.DostoevskyUTF8

m.Dostoevsky = STRCONV(m.DostoevskyUTF8, 11, 204, 2)

? m.Dostoevsky Font "Arial", 12, 204

m.DostoevskyUNICODE = STRCONV(m.DostoevskyUTF8, 12)

? STRCONV(Unicode_Upper(m.DostoevskyUNICODE), 6, 204, 2) Font "Arial", 12, 204
? STRCONV(Unicode_Lower(m.DostoevskyUNICODE), 6, 204, 2) Font "Arial", 12, 204

FUNCTION Unicode_Upper (Source AS String)

	LOCAL StrBuffer AS String

	m.StrBuffer = m.Source
	W32_Unicode_Upper(@m.StrBuffer, INT(LEN(m.StrBuffer) / 2))

	RETURN m.StrBuffer

ENDFUNC

FUNCTION Unicode_Lower (Source AS String)

	LOCAL StrBuffer AS String

	m.StrBuffer = m.Source
	W32_Unicode_Lower(@m.StrBuffer, INT(LEN(m.StrBuffer) / 2))

	RETURN m.StrBuffer

ENDFUNC
 
No, you don't have to store data in UTF-8, you just have to make the conversions on the way from DBF to the web and then reconvert to ANSI when the data comes in. You avoid HTML entities and have one less problem and the web world works best in UTF8.

Atlopes also gave you some workarounds for working with UTF-8 in VFPm but you don't need to go that far.

If you think of HTML entities of the minor problem: The characters of your DBF for a certain ASC value coming in by the HTML entity may NOT be what the user has entered. You can't really force the web browser to work in your DBF Ansi codepage, yes there also are those content encodings and browsers still support them, but you already see you have your problems with that.

If users really entered valid characters of the charset the HTML page is set to, then the form submission would send them 1:1 and not convert that to HTML entities. Do you really think users type this in that way?

Bye, Olaf.

Olaf Doschke Software Engineering
 
I have a languages table with one column per language (swe,eng,fin,nor and cze). I just did this test:
1) I transformed all these strings for all languages to UTF8 with with strconv(<lang>,9).
2) I changed my server to to put UTF8 in the HTTP header (Content-Type: text/html; charset=utf-8).
3) In my reports I use strconv(<text from lang table>,11) when I fetch texts from the languages table to display it right.

Isn't this the cleanest solution?

I'm thinking of using a boolean to decide if the text fetching function shall return the string as it is (UTF8) or as strconv(<text from lang table>,11). Then I don't need to change all my reports, just set this boolean before generating them.
 
If your app has to handle multiple languages you can't display with the same ANSI codepage you're confronted with further problems. Then working the best you can in Unicode throughout maybe is a solution.

If I assume in step 1) you mean you actually store UTF8 in each language column and a browse will only show the parts of texts correctly, which are the 26 Latin letters and you make no conversion when you output this to the web and convert to ANSI in reports, that works, but you have the disadvantages working with strings you mentioned yourself.

I don't know how I'd work with this, the string functions for double-byte characters don't handle UTF-8 or any Unicode variant, they only handle some ANSI codepages with double-byte characters. Steven Black has described it very well, what effort it is to not go full Unicode but want to do it the ANSI way and enable, for example, Japanese for Non-Unicode Program in Windows ( eastern European languages still are in the single-byte character sets.

You're using the right STRCONV parameters, nothing against that, but with UTF-8 you do introduce some double-byte characters that don't translate into the most general 1252 codepage, so you can't keep the advantage of not needing AT_C(), SUBSTRC(), etc. double-byte character string functions, but you also can't use them on UTF-8 strings. You just will have no fun with BROWSE of the texts and editing them.

Staying with ANSI you'd have to go the harder route to have separate tables for the languages in different codepages and your forms would still just use one current codepage, reports likewise. Regarding point 3) STRCONV will work on the assumption the <text from lang table> is in the DBF codepage if you specify the text as table.field. When you first copy the text into a string variable it will work on the assumption the variable contains the current application codepage. So for that conversion into the codepage, you will need for the language you will need to ensure the application has the right setting and then it won't support all languages. Just the ones reports can print with the current codepage. STRCONV() has additional parameters to set the target codepage, especially if the source string is Unicode or UTF-8 that's helpful, but the report will work on the current codepage anyway. Strings don't get a marker or meta data what codepage they are.

Because of that, you can not support all languages in a single application session, you can only override the usage of the Windows system codepage when you specify codepage=... in a config.fpw, CPCURRENT() will then tell you that, but there is no SET CODEPAGE to let reports run in different codepages.

For that reason, overall, I think I'd split the languages DBF into separate DBFs for each language, use the appropriate DBF codepage for the language and then at startup you can offer switching languages CPCURRENT() support, the codepage your application process is set to. And then don't store UTF-8, be able to work normally inside VFP including reports and only convert and reconvert when transitioning to the web. Even if the web is your main frontend.

There is a situation that slightly differs and which I recently tried for the first time: ISAPI (see thread184-1797409). When you embed your web output into Apache or IIS via the foxisapi.dll and write an EXE COM Server for the web page outputs, the way the foxisapi.dll works is creating a new instance of your COM SERVER each time. If you manage that to happen with the correct codepage (I have no idea, but for example, a PHP helper script would need to swap out different config.fpw for the COM Server EXE before that gets called) you can switch the codepage used by VFP and VFP reports in every web request made.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Thank you Olaf for all input on this. Now I'm leaning against keeping my strings as ANSI anyway and store the CZE strings with HTML entities included. I thought of using the HTMLNumEntityToANSI() function above written by atlopes for reports (non-web environment).

alopes, could you explain:
Your example uses it like this:
? HTMLNumEntityToANSI("FBŠ SLAVIA Plze&#328;", m.Encoding) Font "Arial", 12, m.Encoding

If I use just the function call in my reports without the "Font "Arial", 12, m.Encoding" part, some letters are not transformed correctly. For example "ě" is displayed as "ì". How to solve this? I use Times New Roman in all of my reports. Do I have to hard code that?
 
I should also mention that the language table is only for the strings that are part of the application/system. Users are also entering texts, for example team and player names. This means that Swedish, English, Finnish, Norwegian and Czech users are entering strings into the very same database tables and these text shall of course look OK together. For the web part, this is no problem with all strings in ANSI and Czech strings with HTML entities. To make the Swedish åäö display correctly, I let the server use "Content-Type: text/HTML; charset=iso-8859-1" in the HTTP header.
 
Mibosoft,

The key factor is the character set value. It can be set as part of the Font clause in a ? statement (the m.Encoding in the example you presented), or a FontCharSet property in any control that uses a font to display data, or indirectly as the assingned script in a font / GetFont() selection.

You may store data in your tables that match different ANSI character sets as long as you are a) able to identify the correct character set to display or process a particular character based field of your table; b) do not expect to mix character sets in the same field; and c) inhibit code page translation in the affected fields.

In the image below you can see a grid that represents these data:

Code:
CREATE CURSOR test (col1 Varchar(200) NOCPTRANS, col2 Integer)

INSERT INTO test VALUES ("Hugo, Victor", 0)
INSERT INTO test VALUES ("ÇÈæ ÇáØíÈ ãÊäÈí", 178)
INSERT INTO test VALUES ("Äîñòî¼åâñêè, Ô¼îäîð", 204)

Clipboard01_uzcc6u.png


The simultaneous display of different scripts that can be observed resides on top of the Grid's dynamic capabilities: each script has is its own textbox control set with its own FontCharSet property.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top