Determine encoding of text 2

srwatson · Dec 30, 2011

Is there away to determine the encoding of text in a PCL file? Lets say I have two PCL files. The first one prints 'Hello World!' and it's encoding is ASCII. The second file I have prints 'I love pie!' and it's encoding is EBCDIC.

(esc) *p99XHello World!
(esc) *p99XÉ@“—‰…Z

Is there away to determine which file has the EBCDIC text and which file has the ASCII text?

~ Thanks

webrabbit · Dec 30, 2011

Look at it with an ascii editor, which will display the files similarly as you have shown above. If you can read it, it is ASCII. If not, and you know the only alternative is EBCDIC, then that is what it is

To verify it, you need to use an editor that can display EBCDIC, or hex. For hex, you need to have an EBCDIC character chart to translate the hex values to EBCDIC display. Note that in EBCDIC, the alphabet is not contiguous.but has other characters between I & J, as well as R & S. The lower case alphabet has the same gaps.

Note also that (esc)*p99X is a PCL command to indent 99 units, which usually would be abour 1/3".

jlasman · Dec 30, 2011

In the PCL file, there is probably a Symbol Set definition that when applied will render the text in its original form.

Jim Asman

http://www.spectracolorservices.com

srwatson · Dec 30, 2011

The PCL commands above were used as an example. I'm looking for a programatic solution to determine the encoding of a PCL file. The application I'm working on currently parses and processes ASCII and EBCDIC files. After the file has been manipulated it's written back to the file system as ASCII if it was orginally an EBCDIC file. Currently the application looks at the first page for a certain word/phrase that should appear in all documents. If it can't be read then it's assumed EBCDIC. I was wondering how printers/viewers are able to distinguish between the two encodings. I really don't want the parser portion of the application to beaware of any business logic. This will limit the use case of the parser.

jlasman · Dec 30, 2011

That is what the symbol set is for in a PCL file. As I stated previously, you should find a symbol set specification in the file that gives the character mappings.

Jim Asman

http://www.spectracolorservices.com

DansDadUK · Dec 31, 2011

Your sample does not appear to make logical sense:

(a) The string 'I love pie!' is 11 characters long;

(b) The string 'É@"—‰...Z' is only 9 characters long.

(c) Assuming that the quoted string 'É@"—‰...Z' is what is displayed on a device which assumes an extended-ASCII-based encoding (such as 'Windows ANSI', or 'ISO 8859-1 Latin-1'), this would be the (hexadecimal) character codes:
C9 40 22 97 89 2E 2E 2E 5A

which is unlikely to be plain text (upper-case and lower-case alphabetic characters, and simple punctuation) encoded in any of the various 'standard' EBCDIC encodings; closest would be 'I..pi...!' where the '.' characters represent non-graphic characters.

(d) Perhaps much more likely is that the text is encoded using the 'obfuscation' techniques associated with downloaded soft fonts, so recovering the 'plain text' is impossible (or, at least, very difficult, without a very good knowledge of the downloaded soft font).

(e) ... or it could be, as Jim Asman suggests, using an obscure, or user-defined, symbol set (the HP PCL name for 'coded character set').

Without analysing the whole PCL file, it is impossible to be sure just how the text is encoded.

DansDadUK · Jan 3, 2012

... and if (because of private data) you don't want to post samples of your PCL files here for analysis, you can analyse them yourself using the 'PRN File Analyse' tool in the PCL Paraphernalia application (which you can obtain via

http://www.pclparaphernalia.eu

).

srwatson · Jan 3, 2012

I've attached two small files. I replaced all of the original text with asterisks for security reasons. I used asterisk-ebcdic.pcl as input file. I'm wondering if there is away to scan the input file programatically and determine it's encoding. Currently the parser has to be told the encoding of the file. Here is sudo code of what I'm doing.

Code:

  def isEbcdic = true
  new PCLParser(file, isEbcdic).eachCommand { cmd ->
     if(cmd.isText()) {
     	cmd.setData(new String(cmd.data, "cp037").bytes)   \\ Where cp037 is the carset name
     }
  }

pcltools · Jan 3, 2012

HP does not have a EBCDIC symbol set. So, most files that start-off as EBCDIC run through some type of protocol conversion and the EBCDIC is mapped to a download font in a custom symbol set. This is a file we use in our product to map those characters.

32 64
à 68
¢ 74
. 75
< 76
( 77
+ 78
| 79
& 80
é 81
è 84
! 90
$ 91
* 92
) 93
\ 94
¬ 95
- 96
/ 97
Ñ 105
, 107
% 108
_ 109
> 110
? 111
: 122
# 123
@ 124
' 125
= 126
" 127
;
; US ASCII
;
a 129
b 130
c 131
d 132
e 133
f 134
g 135
h 136
i 137
° 144
j 145
k 146
l 147
m 148
n 149
o 150
p 151
q 152
r 153
s 162
t 163
u 164
v 165
w 166
x 167
y 168
z 169
] 181
` 185
A 193
B 194
C 195
D 196
E 197
F 198
G 199
H 200
I 201
ô 203
J 209
K 210
L 211
M 212
N 213
O 214
P 215
Q 216
R 217
S 226
T 227
U 228
V 229
W 230
X 231
Y 232
Z 233
0 240
1 241
2 242
3 243
4 244
5 245
6 246
7 247
8 248
9 249

However, there are many other ways to get from EBCDIC to ASCII PCL. So, trying to solve these types of problems without a sample file is painful. You should generate a mock-up file for analysis.

srwatson · Jan 3, 2012

I attached a zip file in my previous post. If you can't view the attachment here is a link to the zip file

https://github.com/downloads/born2snipe/street/encoding.zip

pcltools · Jan 3, 2012

It appears as though you replaced all the printablb text in the files with asterisks. Are you just trying to make it as difficult as possible for someone to help you?

Aside from that, your ASCII file has a partial, temporary download font bound to the undefined default symbol set.

The EBCDIC file has a partial, temporary download font bound to a custom symbol set. If the mapping file that I provided shows you that a "B" is remapped to cell 194 in the ISO 8859/1 Latin I (E1) character set and so on.

So, you're in luck the characters are not "scambled". But, the EBCDIC characters are just using a custom symbol set because HP does not have one for EBCDIC.

If you change printer drivers, fonts, point sizes ... you could be back in the soup.

DansDadUK · Jan 4, 2012

As pcltools has already advised:

(a) The characters to be printed (in both the 'ASCII' and 'EBCDIC' samples) are using a custom 'symbol set' which is effectively defined by the characters downloaded in the custom (bitmap) soft font download.

(b) So the character mapping from the original source documents to the values used in the PCL files is effectively defined by the process that generates the downloaded soft font files.

(c) With the data characters in your (doctored) samples replaced by asterisks (ASCII sample) or backslash (EBCDIC sample), it is difficult to see what characters are used - and you'd need the original documents to work out the mapping.

Attached are analyses of your two .pcl files

DansDadUK · Jan 4, 2012

... and using the two soft fonts to (attempt to) print all characters (range 0x32 - 0xff) appears to show the following mappings between code-point (given as a hexadecimal value) and the ASCII character:

ASCII font:

Code:

0x2a *
0x2f /
0x30 0
0x31 1
0x32 2
0x34 4
0x37 7
0x38 8
0x39 9
0x3a :
0x41 A
0x42 B
0x43 C
0x44 D
0x45 E
0x46 F
0x47 G
0x49 I
0x4a J
0x4c L
0x4d M
0x4e N
0x4f O
0x50 P
0x51 Q
0x52 R
0x53 S
0x54 T
0x55 U
0x56 V
0x57 W
0x58 X
0x59 Y

EBCDIC font:

Code:

0x5c *
0x61 /
0x7a :
0xc1 A
0xc2 B
0xc3 C
0xc4 D
0xc5 E
0xc6 F
0xc7 G
0xc9 I
0xd1 J
0xd3 L
0xd4 M
0xd5 N
0xd6 O
0xd7 P
0xd8 Q
0xd9 R
0xe2 S
0xe3 T
0xe4 U
0xe5 V
0xe6 W
0xe7 X
0xe8 Y
0xf0 0
0xf1 1
0xf2 2
0xf4 4
0xf7 7
0xf8 8
0xf9 9

Note that (on both cases) some of the alphabetic characters and digits do not appear to be defined.

DansDadUK · Jan 4, 2012

So to return to your original question:

>> Is there a way to determine which file has the EBCDIC text and which file has the ASCII text?

The answer is 'not very easily', since you'd have to be able to interpret the downloaded soft fonts (and with a custom symbol set you're in the realm of working with 'shapes', rather than defined mappings).

Of course, if the two soft fonts (one for ASCII, the other for EBCDIC) were always the same, you could perhaps recognise which one was in use by the 'signature' of its header.
... but it seems unlikely that they WILL always be the same for each file, since the sample ones you've provided don't include all the alphabetic characters or digits - although perhaps the header may always be the same?

Note that the fonts are the old format-0 bitmap fonts, which may, or may not, be supported on modern LaserJet devices.
... and use of a 'unit of measure' of 300 PCL units-per-inch perhaps indicates the age of the generated PCL.

DansDadUK · Jan 4, 2012

... an attached are more verbose analyses of your .pcl files, showing the character shapes associated with each downloaded soft font (bitmap) character.

DansDadUK · Jan 9, 2012

Any feed-back?

srwatson · Jan 9, 2012

I wasn't aware that HP didn't have a concept of EBCDIC. The information provided above was useful.

You mentioned previously that the fonts used in the file are format-0 bitmap. Do you mind elaborating a bit on that? Is it possible the PCL in this file is PCL4?

DansDadUK · Jan 10, 2012

>> fonts used in the file are format-0 bitmap. Do you mind elaborating a bit on that?

There are a number of PCL soft font formats:

0 - original bitmap format; now deprecated; "not recommended for LaserJet 4 and later printers".

10 - Intellifont Bound scalable
11 - Intellifont Unbound scalable

Intellifont format has fallen out of favour, and may not be supported on modern devices.

15 - TrueType scalable (bound and unbound)
16 - Universal: as TrueType scalable (but capable of 'large font' support).

20 - Resolution-specified bitmap; replaced format 0 fonts.

>> Is it possible the PCL in this file is PCL4?

Possibly, although as PCL5 is backwards compatible, difficult to say from your small sample.

srwatson · Jan 10, 2012

Thanks, I'm interesting in knowing more about some of this. Do you know where I could learn more about the changes from PCL4 to PCL5? Also, I'd also be interesting in knowing what other PCL is now deprecated. Is this documented some where?

DansDadUK · Jan 10, 2012

... I forgot to mention that format 16 fonts can be used to define bitmap fonts, as well as TrueType scalable.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Determine encoding of text 2

Programmer

MIS

Technical User

Programmer

Technical User

Programmer

Programmer

Programmer

Vendor

Programmer

Vendor

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor