Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Detect Asian Characters 2

Status
Not open for further replies.
Hi

The first document is delivered with this HTTP header :

[tt]Content-Type: text/html; charset=big5[/tt]

The Big-5 character set makes clear that is Asian text.

The second document is delivered with this HTTP header :

[tt]Content-Type: text/html; charset=utf-8[/tt]

The CJK ideographs contained by UTF-8 characters set have character codes in the range 4E00-9FFF. You will have to check for the existence of such character in the document.

Or I misunderstand your problem.

Feherke.
 
You do understand my problem. Language headers are not always reliable. So I would check if the page is UTF8 and search for characters in the range 4E00-9FFF? Also, what does 'CJK' stand for?
 
[1] To guess the character encoding of a page is no small business. The heritage of mozilla having been open-sourced has a well-deserved renowned piece of code for it chardet.dll by Frank Y F Tang. It is ported to different implementations (java, python,...?).

[2] You can check out the jchardet, java plotting, source-forge project, free download.

[3] With jchardet, the "guess" for you will be done by running a command line like this with properly located in the current directory.
[tt] java -classpath chardet.jar org.mozilla.intl.chardet.HtmlCharsetDetector [ignore][/ignore][/tt]
 
The Vietnamese alphabets are listed in several noncontiguous Unicode ranges:

Basic Latin {U+0000..U+007F}
Latin-1 Supplement {U+0080..U+00FF}
Latin Extended-A, -B {U+0100..U+024F}
Latin Extended Additional {U+1E00..U+1EFF}
Combining Diacritical Marks {U+0300..U+036F}

Is there some simple way, short of regexs, to search for these unicode chars.?
 
The regex would not be the appropriate tool for the search. The search must be conducted at the binary level, not at php string level which has already supposed some decoding scheme.
 
Any advice on how to do a binary search of an HTML page?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top