Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

converting ut-8 (unicode) characters to an extended ascii

Status
Not open for further replies.

jjohnn

Technical User
Feb 11, 2003
43
US
I want to go through a text file coded in utf-8 and find the lines and words with occurences of non-ascii characters (codes > 127), and a table/hash of each character and how many times it appears.

Then I want to convert the file to an extended ascii; three separate conversions are to:

ISO 8859-1 (Latin-1)
CP1252 (ANSI Windows)
OEM-DOS 437

Each of these three are extensions to ASCII, but diferent from each other.

I found a regex substitution that makes a conversion, but I don't understand it:

s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;

What are the &quot;<<&quot; doing?
What are the &quot;&&quot; doing?

Thank you.
 
<< is a bitwise shift left operator. for example, binary 1011 shifted left by one bit is 10110

& is the bitwise and operator- it takes two binary values, and returns a third, which has a 1 if the coressponding bits of BOTH the arguments are set, 0 otherwise.

10111
11100 &
=====
10100

| is the bitwise or, which is like &, but sets each bit if EITHER value's bits are set.

00111
01100 |
=====
01111

ord($2) & 0x3F == Masks off the low 6 bits of the byte number for $2. (0x3f == 111111 binary)

ord($1) << 6 == shifts the bits of $1 6 bits to the left.

essentially, this line pulls out two bytes, and replaces any extended (0xC2 or 0xC3) characters with a single byte, whose value is the second byte, with the high two bits taken from the lower two bits of the first (extended) character.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top