Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Determine encoding of text 2

Status
Not open for further replies.

srwatson

Programmer
Dec 13, 2011
16
CA
Is there away to determine the encoding of text in a PCL file? Lets say I have two PCL files. The first one prints 'Hello World!' and it's encoding is ASCII. The second file I have prints 'I love pie!' and it's encoding is EBCDIC.

(esc) *p99XHello World!
(esc) *p99XÉ@“—‰…Z

Is there away to determine which file has the EBCDIC text and which file has the ASCII text?

~ Thanks
 
Look at it with an ascii editor, which will display the files similarly as you have shown above. If you can read it, it is ASCII. If not, and you know the only alternative is EBCDIC, then that is what it is

To verify it, you need to use an editor that can display EBCDIC, or hex. For hex, you need to have an EBCDIC character chart to translate the hex values to EBCDIC display. Note that in EBCDIC, the alphabet is not contiguous.but has other characters between I & J, as well as R & S. The lower case alphabet has the same gaps.

Note also that (esc)*p99X is a PCL command to indent 99 units, which usually would be abour 1/3".
 
The PCL commands above were used as an example. I'm looking for a programatic solution to determine the encoding of a PCL file. The application I'm working on currently parses and processes ASCII and EBCDIC files. After the file has been manipulated it's written back to the file system as ASCII if it was orginally an EBCDIC file. Currently the application looks at the first page for a certain word/phrase that should appear in all documents. If it can't be read then it's assumed EBCDIC. I was wondering how printers/viewers are able to distinguish between the two encodings. I really don't want the parser portion of the application to beaware of any business logic. This will limit the use case of the parser.
 
Your sample does not appear to make logical sense:

(a) The string 'I love pie!' is 11 characters long;

(b) The string 'É@"—‰...Z' is only 9 characters long.

(c) Assuming that the quoted string 'É@"—‰...Z' is what is displayed on a device which assumes an extended-ASCII-based encoding (such as 'Windows ANSI', or 'ISO 8859-1 Latin-1'), this would be the (hexadecimal) character codes:
C9 40 22 97 89 2E 2E 2E 5A

which is unlikely to be plain text (upper-case and lower-case alphabetic characters, and simple punctuation) encoded in any of the various 'standard' EBCDIC encodings; closest would be 'I..pi...!' where the '.' characters represent non-graphic characters.

(d) Perhaps much more likely is that the text is encoded using the 'obfuscation' techniques associated with downloaded soft fonts, so recovering the 'plain text' is impossible (or, at least, very difficult, without a very good knowledge of the downloaded soft font).

(e) ... or it could be, as Jim Asman suggests, using an obscure, or user-defined, symbol set (the HP PCL name for 'coded character set').

Without analysing the whole PCL file, it is impossible to be sure just how the text is encoded.
 
... and if (because of private data) you don't want to post samples of your PCL files here for analysis, you can analyse them yourself using the 'PRN File Analyse' tool in the PCL Paraphernalia application (which you can obtain via ).
 
I've attached two small files. I replaced all of the original text with asterisks for security reasons. I used asterisk-ebcdic.pcl as input file. I'm wondering if there is away to scan the input file programatically and determine it's encoding. Currently the parser has to be told the encoding of the file. Here is sudo code of what I'm doing.


Code:
  def isEbcdic = true
  new PCLParser(file, isEbcdic).eachCommand { cmd ->
     if(cmd.isText()) {
     	cmd.setData(new String(cmd.data, "cp037").bytes)   \\ Where cp037 is the carset name
     }
  }
 
 https://github.com/downloads/born2snipe/street/encoding.zip
HP does not have a EBCDIC symbol set. So, most files that start-off as EBCDIC run through some type of protocol conversion and the EBCDIC is mapped to a download font in a custom symbol set. This is a file we use in our product to map those characters.

32 64
à 68
¢ 74
. 75
< 76
( 77
+ 78
| 79
& 80
é 81
è 84
! 90
$ 91
* 92
) 93
\ 94
¬ 95
- 96
/ 97
Ñ 105
, 107
% 108
_ 109
> 110
? 111
: 122
# 123
@ 124
' 125
= 126
" 127
;
; US ASCII
;
a 129
b 130
c 131
d 132
e 133
f 134
g 135
h 136
i 137
° 144
j 145
k 146
l 147
m 148
n 149
o 150
p 151
q 152
r 153
s 162
t 163
u 164
v 165
w 166
x 167
y 168
z 169
] 181
` 185
A 193
B 194
C 195
D 196
E 197
F 198
G 199
H 200
I 201
ô 203
J 209
K 210
L 211
M 212
N 213
O 214
P 215
Q 216
R 217
S 226
T 227
U 228
V 229
W 230
X 231
Y 232
Z 233
0 240
1 241
2 242
3 243
4 244
5 245
6 246
7 247
8 248
9 249

However, there are many other ways to get from EBCDIC to ASCII PCL. So, trying to solve these types of problems without a sample file is painful. You should generate a mock-up file for analysis.
 
It appears as though you replaced all the printablb text in the files with asterisks. Are you just trying to make it as difficult as possible for someone to help you?

Aside from that, your ASCII file has a partial, temporary download font bound to the undefined default symbol set.

The EBCDIC file has a partial, temporary download font bound to a custom symbol set. If the mapping file that I provided shows you that a "B" is remapped to cell 194 in the ISO 8859/1 Latin I (E1) character set and so on.

So, you're in luck the characters are not "scambled". But, the EBCDIC characters are just using a custom symbol set because HP does not have one for EBCDIC.

If you change printer drivers, fonts, point sizes ... you could be back in the soup.
 
As pcltools has already advised:

(a) The characters to be printed (in both the 'ASCII' and 'EBCDIC' samples) are using a custom 'symbol set' which is effectively defined by the characters downloaded in the custom (bitmap) soft font download.

(b) So the character mapping from the original source documents to the values used in the PCL files is effectively defined by the process that generates the downloaded soft font files.

(c) With the data characters in your (doctored) samples replaced by asterisks (ASCII sample) or backslash (EBCDIC sample), it is difficult to see what characters are used - and you'd need the original documents to work out the mapping.

Attached are analyses of your two .pcl files
 
 http://www.mediafire.com/?l8dlzp5ntrvlkrs
... and using the two soft fonts to (attempt to) print all characters (range 0x32 - 0xff) appears to show the following mappings between code-point (given as a hexadecimal value) and the ASCII character:

ASCII font:
Code:
0x2a *
0x2f /
0x30 0
0x31 1
0x32 2
0x34 4
0x37 7
0x38 8
0x39 9
0x3a :
0x41 A
0x42 B
0x43 C
0x44 D
0x45 E
0x46 F
0x47 G
0x49 I
0x4a J
0x4c L
0x4d M
0x4e N
0x4f O
0x50 P
0x51 Q
0x52 R
0x53 S
0x54 T
0x55 U
0x56 V
0x57 W
0x58 X
0x59 Y

EBCDIC font:
Code:
0x5c *
0x61 /
0x7a :
0xc1 A
0xc2 B
0xc3 C
0xc4 D
0xc5 E
0xc6 F
0xc7 G
0xc9 I
0xd1 J
0xd3 L
0xd4 M
0xd5 N
0xd6 O
0xd7 P
0xd8 Q
0xd9 R
0xe2 S
0xe3 T
0xe4 U
0xe5 V
0xe6 W
0xe7 X
0xe8 Y
0xf0 0
0xf1 1
0xf2 2
0xf4 4
0xf7 7
0xf8 8
0xf9 9

Note that (on both cases) some of the alphabetic characters and digits do not appear to be defined.
 
So to return to your original question:

>> Is there a way to determine which file has the EBCDIC text and which file has the ASCII text?

The answer is 'not very easily', since you'd have to be able to interpret the downloaded soft fonts (and with a custom symbol set you're in the realm of working with 'shapes', rather than defined mappings).

Of course, if the two soft fonts (one for ASCII, the other for EBCDIC) were always the same, you could perhaps recognise which one was in use by the 'signature' of its header.
... but it seems unlikely that they WILL always be the same for each file, since the sample ones you've provided don't include all the alphabetic characters or digits - although perhaps the header may always be the same?

Note that the fonts are the old format-0 bitmap fonts, which may, or may not, be supported on modern LaserJet devices.
... and use of a 'unit of measure' of 300 PCL units-per-inch perhaps indicates the age of the generated PCL.
 
I wasn't aware that HP didn't have a concept of EBCDIC. The information provided above was useful.

You mentioned previously that the fonts used in the file are format-0 bitmap. Do you mind elaborating a bit on that? Is it possible the PCL in this file is PCL4?
 
>> fonts used in the file are format-0 bitmap. Do you mind elaborating a bit on that?

There are a number of PCL soft font formats:

0 - original bitmap format; now deprecated; "not recommended for LaserJet 4 and later printers".

10 - Intellifont Bound scalable
11 - Intellifont Unbound scalable

Intellifont format has fallen out of favour, and may not be supported on modern devices.

15 - TrueType scalable (bound and unbound)
16 - Universal: as TrueType scalable (but capable of 'large font' support).

20 - Resolution-specified bitmap; replaced format 0 fonts.


>> Is it possible the PCL in this file is PCL4?

Possibly, although as PCL5 is backwards compatible, difficult to say from your small sample.
 
Thanks, I'm interesting in knowing more about some of this. Do you know where I could learn more about the changes from PCL4 to PCL5? Also, I'd also be interesting in knowing what other PCL is now deprecated. Is this documented some where?
 
... I forgot to mention that format 16 fonts can be used to define bitmap fonts, as well as TrueType scalable.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top