I want to go through a text file coded in utf-8 and find the lines and words with occurences of non-ascii characters (codes > 127), and a table/hash of each character and how many times it appears.
Then I want to convert the file to an extended ascii; three separate conversions are to:
ISO 8859-1 (Latin-1)
CP1252 (ANSI Windows)
OEM-DOS 437
Each of these three are extensions to ASCII, but diferent from each other.
I found a regex substitution that makes a conversion, but I don't understand it:
s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
What are the "<<" doing?
What are the "&" doing?
Thank you.
Then I want to convert the file to an extended ascii; three separate conversions are to:
ISO 8859-1 (Latin-1)
CP1252 (ANSI Windows)
OEM-DOS 437
Each of these three are extensions to ASCII, but diferent from each other.
I found a regex substitution that makes a conversion, but I don't understand it:
s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
What are the "<<" doing?
What are the "&" doing?
Thank you.