Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Character set compare problems

Status
Not open for further replies.

BobbaFet

Programmer
Feb 25, 2001
903
0
0
NL
Hi all,

Thanks for reading my question!

First of all some basic stats: I am using Delphi 7 Enterprise along with the Indy 9 component set. (The version of 10 you can download from the Indy website seems glitchy hence I've switched back to version 9).

My program uses idHTTP to collect data from various websites and puts it in one page. But now I have run in the follow problem:

I have a stringlist with data that I wish to remove prior to displaying the result. So I use your basic for...to...do loop to go through the entire mass of data collected and the list to do my case insensitive compares by using the CompareText() function. But it wouldn't remove the data even though it was clearly there. So after some research I have come to the conclusion that this is due to the code pages or character sets the various websites use, which are:

UTF8, ISO-8859-1 and ISO-8859-15

I have gotten this info from reading out the <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> information. (This site for example uses UTF-8).

So that is part of the problem solved, at least I know what I am dealing with.

So then I started by diving into the Delphi help and I came across idHTTP's AcceptCharSet request-header thingamajig. Having run trials with it all the websites will still return the "everything's cool n froody" statuscode 200 even though they apparantly still deliver the pages in whatever encoding they had it originally in without supplying the "406 Not Acceptable" statuscode as they are supposed to (according to (even though it says that supplying an unacceptable character set is acceptable :S)) if they can't deliver in said charset. So that is pretty much useless to me.

Of course, I also came across the various character set switching utilities provided by Delphi such as UTF8Encode() for example.

Now I am facing the following problems:
1. What code page is Delphi using? I cannot seem to find this information.
2. What would the best way be to handle this problem? I know which website delivers the HTML-document in which charset, but now how to move ahead?

I hope I explained my problem thoroughly enough and I hope you guys can help me out because I am having a serious case of cerebral flatulence here.

Kind regards,

[bobafett] BobbaFet [bobafett]
Code:
if not Programming = 'Severe Migraine' then
                       ShowMessage('Eureka!');
 
1. What code page is Delphi using? I cannot seem to find this information.

Whatever the OS is using. Reading the Delphi online help for "code pages" will be instructive. As well, see GetOEMCP and . Of course, this will vary depending on the choice of character set used in the Delphi app as well. Looking at [url=http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html]Sysutils.TEncoding should net some interest as well if you have access to it. You'll have to manually do it if you don't have access to that.

2. What would the best way be to handle this problem? I know which website delivers the HTML-document in which charset, but now how to move ahead?

Convert it (you can't force anything to present in a way that it isn't made to). A good start on it should be above.

It is not possible for anyone to acknowledge truth when their salary depends on them not doing it.
 
I'm not convinced your problems stems from character set differences.

One more thing to check before you go down that path:
is your for loop starting at 0 and count up or at the last element of the stringlist and count down?

if you start at 0, then as you delete the string, your reference to the next element would skip one string. (i.e. remove row 3, now makes what was row 4, your row 3...) This would also cause an access violation.
 
It's not the loop that gives an error, I would receive a "index out of bounds" error if it was. The problem is that the string "Related Doc" in UTF8 is not the same as the OS/Delphi code page "Related Doc". Therefore I cannot do the following:

Code:
if CompareText(PseudoUTF8('Hello world!'), PseudoANSI('Hello world!')) = 0 then
    StrLst.Delete(i);

Even though in the webbrowser due to the charset support it offers displays it the same they aren't the same for CompareText until after I convert them to being the same charset.

[bobafett] BobbaFet [bobafett]
Code:
if not Programming = 'Severe Migraine' then
                       ShowMessage('Eureka!');
 
Of course I pressed post to early... I did check it with for...downto...do loops but the result is the same, hence I, after said research, came to the conclusion that this has to be a code page issue.

[bobafett] BobbaFet [bobafett]
Code:
if not Programming = 'Severe Migraine' then
                       ShowMessage('Eureka!');
 
Thanks btw for the advice Glenn, but unfortunately I do not have access to TEncoding :(

And,

"It is not possible for anyone to acknowledge truth when their salary depends on them not doing it."

The CO2/climate crap, I presume?

[bobafett] BobbaFet [bobafett]
Code:
if not Programming = 'Severe Migraine' then
                       ShowMessage('Eureka!');
 
Thanks btw for the advice Glenn, but unfortunately I do not have access to TEncoding :(

Then it can be done manually. I don't know what you're seeing (that would be helpful), so I can't make any more suggestions. As you may have gathered, there are ANSI, OEM, and Unicode characters that can be worked with in the OS & Delphi. The idea is to take the text you get from the web server (which is copied faithfully, the UTF-8 or whatever is just telling you what it is) and then convert it to whatever you are using.

The first thing I would try is to just convert between the different Delphi character sets, depending on what the Delphi you are using supports. What I find is that most things are either Unicode or ANSI and you can go with that. Researching those char-set designators also leads me to think that. UTF-8 is Unicode, ISO-8859-? is ANSI (the difference between the two you posted is of no concern as long as you aren't dealing with any Eastern European languages).

Code:
function WideCharToString(Source: PWideChar): string;
function StringToWideChar(const Source: string; Dest: PWideChar; DestSize: Integer): PWideChar;

Or typecasting WideString/AnsiString might get it too.

If this doesn't work, you are dealing with something much more complex, and you would need to try what is one of these links. This is likely what TEncoding is implementing.


"It is not possible for anyone to acknowledge truth when their salary depends on them not doing it."

The CO2/climate crap, I presume?

Anything in human nature and endeavor (business, science, you name it), actually, and this has been shown evident for most of human history. If a lie or deception enables money to be made, no one generally will acknowledge the truth if it means the money goes away by doing it.

It is not possible for anyone to acknowledge truth when their salary depends on them not doing it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top