Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Reading MS-Word-created HTML files

Status
Not open for further replies.

esanteva

Programmer
Sep 23, 2003
7
IT
Hi,
I would like to remove some HTML code from am HTML file that has been created with MS Word. That file will contain some language specific characters (specifically, Italian ones, i.e. accented characters). It seems that if I "simply" read the file (using "open...r"), remove the unwanted HTML code, and re-save the file (using "open...w"), the specific charcters are scrambled in some way.
Is there a way to handle this trasformation by specifing that the encoding of the file is special in some way?

Thanks for any help
 
I think you need to set the encoding switch on the channel with "fconfigure":
fconfigure <channel ID> -encoding <your encoding>

Unfortunately, I don't know what the European charcater encoding name is.

Bob Rashkin
rrashkin@csc.com
 
Ok. I went through the Tcl manual and I tried the following commands on the file I have to process:

set fileId [open scheda0-wr.asp r]
fconfigure $fileId -encoding

The answer is "cp1252". Then I looked at the scheda0-wr.asp file and I noticed the following line:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

So the file encoding is "windows-1252". Is this the same that "cp1252"? In this case I don't know what other encoding I could choose for my "fileId" channel. This is the list of encodings supported by Tcl:

cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857

Do you by any chance think that there is a better one to try with?

Thanks anyway for your help
Stefano
 
No idea. I'd give cp1252 a try.

Bob Rashkin
rrashkin@csc.com
 
Hi verified that the problem is not a Tcl one, but it arises when I use Freewrap (vesion 5.61). Is there a freewarp expert here?
If so, I prepared a small "package" to reproduce the problem.

1- I wrote a simplified version of my routine to get rid of special characters embedded in an ASP file. This is the simplified routine:

##### routine STARTS here
set autoweb(filename) {scheda0-orig.asp}
set autoweb(filenameout) {scheda0-out.asp}

set fileId [open $autoweb(filename) "r"]
set htmlcontent [read $fileId]
close $fileId

# creates string map array
set cmap {}
for {set i 161} {$i < 256} {incr i} {
lappend cmap [format "%c" $i] "&#$i;"
}

# convert special characters to HTML codes
set htmlcontent [string map $cmap $htmlcontent]

# clean ASP code from html document
regsub -all {<%.*%>} $htmlcontent {} htmlcontent

# writes document back to local file
set parFileId [open $autoweb(filenameout) w]
puts $parFileId "${htmlcontent}"
close $parFileId

exit
##### routine ENDS here

When I run this routine in Windows just doubleclicking on the tcl file, everything works fine. I use Tcl version 8.4.6. Instead, if I compile with freewrap 5.61, the output file will be still readable but will contain some evident garbage characters.
2- I also created a simplified version of my ASP file (scheda0-orig.asp in the Tcl code above). I verified that copying and pasting the following code does exactly reproduce the problem:

<html>
<head>
</head>
<body>
<h1>Progetto n°</h1>
</body>
</html>

Thanks for any help
 
I had the answer directly from Dennis LaBelle, author of freewrap:

> If the local system encoding of the machine on which your
> program might run is not suitable for your task, simply
> change the system encoding with a line of code such as:
>
> encoding system /tcl/encoding/iso8859-1

It works. Just put this line as the first line of your script.

Ciao
Stefano
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top