Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to disable MSXML output escaping? 1

Status
Not open for further replies.

Glenn9999

Programmer
Jun 19, 2004
2,312
US
I'm writing a simple XML file using the MSXML client for consumption by another program. The problem is on the text attributes of each of my elements, MSXML is deciding to provide escape characters. While this might be okay, the resulting characters are rejected by the other program upon load.

More or less, I need MSXML to leave my text alone, so I can escape the text properly to suit the program my file is getting inputted on. Any ideas on making this happen?

 
Can you provide some examples on the source data and how it's being escaped by MSXML?
 
Hi Glenn9999,

If you cannot set the XML generator to stop producing unwanted characters in XML, you can always add a post-processing step. In this step, you can eliminate the unwanted characters with regular expressions. Then you pass the corrected XML to the consumer program.
For the purposes of XML file correction, you can use ready-made tools, such as sed, or even write yourself a script - for example, in VBscript.
 
I'm using IXMLDOMDocument in a Delphi program if that helps to clarify things.

>Can you provide some examples on the source data and how it's being escaped by MSXML?

A simple example is if I have the text string "Henry & June" (without the quotes). If I accept the MSXML formatting ("Henry & June") the consumer program comes back with "unknown XML object" where the & appears in the text of the consumer program. In studying the output the consumer program makes, I figured out the consumer program only accepts digits (e.g. "Henry & June"). Problem is if I present that exact string to MSXML, it helpfully comes in and escapes the &, giving me "Henry & June" which produces an even more royal mess than what I already have at this point.

Also, MSXML doesn't escape certain characters I need to have done (namely <, >, ", and ' for what I'm aware of right now, mainly what's going to muck me up in the text presentation), so being able to escape characters myself would be ideal.

>you can always add a post-processing step.

This is what I'm fearing. I don't know if I read a raw XML file into a TMemo or TRichEdit and string search things for if it's going to mangle the XML beyond what is expected. Something I'll have to try.


 
Hi Glenn9999,

for example, if you have in your original file glenn9999.txt this text:
Code:
... foo Henry &amp;#38; June bar baz ...

then the command
Code:
$ sed 's/&amp;/\&/g;s/#38;//g' glen9999.txt > glenn9999_correct.txt

creates corrected file glenn9999_correct.txt which contains this text with unwanted characters/strings removed:
Code:
... foo Henry & June bar baz ...

If you would be interested - sed is available fo windows too:
 
I ended up doing what mikrom suggested in the code itself and got proper output now.

But if anyone knows how to shut off escaping within MSXML itself (a more elegant solution), I'm open to know how.

 
Yes, it seems a bit unusual, but sometimes when the system does not want to do what we need, we are forced to use other means to achieve our goal.
 
glenn9999 said:
But if anyone knows how to shut off escaping within MSXML itself (a more elegant solution), I'm open to know how.

That would be the 'create incorrect XML' setting? [bigsmile]

I think you are going to find a post-process step, as already suggested, to be the easiest fix.

The only alternative I have been able to devise within the context of an XML processor (MSXML and libxml) is to create an XSLT transform and specify text as the output method (xsl:eek:utput), and apply the transform to the XML document as the last step. You will go through a lot of hoops to produce almost correct XML. You can use called templates to create open and close tags in the output stream, and you must take care with output escaping. But I have created such a beast to deal with an XML consumer process that could not ingest certain aspects of well-formed XML; I had to keep it within the XML (and XSLT) realm.

Tom Morrison
Consultant
 
>That would be the 'create incorrect XML' setting?

Nice to see you around! Anyway, MSXML didn't escape 4 of the standard XML characters needed to be escaped, from my understanding (", ', >, <), so I would say it's already doing "create incorrect XML". But I digress. That particular project is done. I'm trying to read back other files now besides the ones I create and found another wrinkle, but hopefully I'll figure out what's going on. If not, I'll probably ask here again.

 
I am glad to see some activity here in the XML forum! Please feel free to come back and ask for help.

Tom Morrison
Consultant
 
Glenn9999

Finally got time to have a look at this.

I created an XML document like this.
XML:
<?xml version="1.0" encoding="utf-8"?>
<foo>
	<bar>Henry &#38; June</bar>
</foo>

After loaded through MSXML, [tt]selectNodes("foo/bar").item(0).text[/tt] returns [tt]Henry & June[/tt], as expected.

When saved using MSXML's [tt]save()[/tt] method, the resulting document is exactly this, again as expected:

XML:
<?xml version="1.0" encoding="utf-8"?>
<foo>
	<bar>Henry &amp; June</bar>
</foo>

Then, I created a new element, using MSXML's [tt]createElement("bar")[/tt], set its [tt]text[/tt] to [tt]"<>'[/tt], inserted it as foo's child with the [tt]appendChild()[/tt] method, and saved the result again.

This was the result:

XML:
<?xml version="1.0" encoding="utf-8"?>
<foo>
	<bar>Henry &amp; June</bar>
	<bar>"&lt;&gt;'</bar></foo>

Finally, I created a new attribute with [tt]createAttribute("attr")[/tt] and set its value to [tt]"<>'[/tt], and set it as a foo's attribute with [tt]attributes.setNamedItem()[/tt] method. When saved, it ended up as:

XML:
<?xml version="1.0" encoding="utf-8"?>
<foo attr="&quot;&lt;&gt;'">
	<bar>Henry &amp; June</bar>
	<bar>"&lt;&gt;'</bar></foo>

This seems ok with me and results in well-formed XML that any XML parser should be able to read. Are you doing things differently?

Note: using MSXML2.DOMDocument.6.0, as I normally do.
 
Like I said, I got what I was wanting to do here completed to satisfaction. But for learning sake (really why I'm doing all of this, this is my first serious project with XML outside of reading certain very small snippets of it for utility sake)...

>Are you doing things differently?

No. I did a similar write test and ended up with below as output (I'll admit the input on the other thing didn't have > and < but note " and ' remain unescaped).

<?xml version="1.0" encoding="UTF-8"?>
<base-tag>"&lt;'Henry &amp; June'&gt;"</base-tag>

Note, the presence of " caused me a major problem when it came to reading the data in the consumer program.

For those keeping score, I didn't have to do anything with the output as I had it above to read it back.

 
Ok, Glenn9999, it's important that you got your problem solved, but for the record, escaping " and ' isn't required unless as part of the contents of an attribute that is delimited by the quote or the apostrophe. That is, attr="&quot;'" and attr='&apos;"' are perfectly fine, as it is <element>"'</element>

Based on what you gave us as info, and not having access to the actual data is being passed between the two systems, I would say that is the other side of processing that fails to comply to XML encoding rules. MSXML is working ok, with is (including reading and processing numeric character references).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top