converting ut-8 (unicode) characters to an extended ascii

jjohnn · Feb 27, 2003

I want to go through a text file coded in utf-8 and find the lines and words with occurences of non-ascii characters (codes > 127), and a table/hash of each character and how many times it appears.

Then I want to convert the file to an extended ascii; three separate conversions are to:

ISO 8859-1 (Latin-1)
CP1252 (ANSI Windows)
OEM-DOS 437

Each of these three are extensions to ASCII, but diferent from each other.

I found a regex substitution that makes a conversion, but I don't understand it:

s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;

What are the "<<" doing?
What are the "&" doing?

Thank you.

stelph · Jul 14, 2003

<< is a bitwise shift left operator. for example, binary 1011 shifted left by one bit is 10110

& is the bitwise and operator- it takes two binary values, and returns a third, which has a 1 if the coressponding bits of BOTH the arguments are set, 0 otherwise.

10111
11100 &
=====
10100

| is the bitwise or, which is like &, but sets each bit if EITHER value's bits are set.

00111
01100 |
=====
01111

ord($2) & 0x3F == Masks off the low 6 bits of the byte number for $2. (0x3f == 111111 binary)

ord($1) << 6 == shifts the bits of $1 6 bits to the left.

essentially, this line pulls out two bytes, and replaces any extended (0xC2 or 0xC3) characters with a single byte, whose value is the second byte, with the high two bits taken from the lower two bits of the first (extended) character.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

converting ut-8 (unicode) characters to an extended ascii

jjohnn

Technical User

stelph

Programmer

Similar threads

Part and Inventory Search

Sponsor