Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How best to handle illegal characters in XML...

Status
Not open for further replies.
Jan 19, 2000
57
US
Hi there!

I haven't had much experience with XML, but a few months ago, I wrote a program that reads in customer-submitted XML file. That particular file contained a large number of ampersands(ex. DiCarlo & Sons Plumbing).

I added a find and replace function to the program that replaced the &'s with &.

Fast forward a few months...a second customer has started submitting XML files to us. Now, I basically took the same program from the first XML file and applied it to the new file. Chaos!

The second XML file contains escaped characters (& instead of &, ' instead of apostrophes). My program is replacing the "&" with &..so the & in the file becomes &. Not good.

Now, it is clear to me that I did a cruddy hack job to process the first file. The second XML file is actually exactly as it should be. When I take out my find & replace function, the program runs beautifully...the & and ' characters are automagically converted to actual & and ' by .NET's wonderful XML methods.

My question is this: What is the best way to deal with a poorly formatted XML file(i.e. one containing stand-alone "&" characters) Is there a set method of dealing with these? How do you differentiate between stand-alone "&" characters, and those contained in an escaped string like "'"?

Any guidance would be appreciated.

- Mikeymac
 
First, I don't know what it is about this word "automagically," but I absolutely loath it. It drives me crazy... Ahh well, just a personal annoyance.

As for the problem you are running into, I would say you could loop through the file looking for & and &amp (or other escaped characters) and replace just those that are not replaced, OR, you could replace &amp, &apo, and all other escape codes with their correct character, then replace all of the characters with the escape codes.

for example:
Code:
dim myString as string = (xml string)
myString.replace("&", "&")
myString.replace("&apo;", "'")

myString.replace("&", "&")
myString.replace("'", "&apo;")

That way, if the XML has already been converted, you wont wind up with &

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]I believe in killer coding ninja monkeys.[monkey]
 
Rick,

I apologize for my use of the neologism "automagically". I also tend to dislike such terms, and am rather surprised that I included it in a post. Such words always reek of effeminacy when uttered by a man. Some of my least favorite neologisms:

anyhoo
bassackwards
adultolescence
blog (people who say the word "blog" tend to draw out the "o"...drives me nuts.)

Also, some commonly heard phrases that make no sense:
"I could care less." = If you COULD "care less", then that means that you currently care to some degree.

"By the by" = What the what?

"Teach his own" = It is "To each his own", you moron!

"fell between the cracks" = If you fall BETWEEN the cracks, then you'll be just fine. Falling THROUGH or INTO the cracks is what should keep you up at night.

I'm doing something similar to your second solution now. I was just wondering if I was missing out on some automa(T)ic feature of .NET's XML methods that handles these issues.

Thank you for your feedback, Rick!

 
No problem Mikey, and no offence ment on the atomagical thing, those are some great examples you posted though. I have an online teacher who uses "Automagically" regularly. If I weren't 2 weeks from graduation I'd totally tear into the lady.

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]I believe in killer coding ninja monkeys.[monkey]
 
I had to deal with this problem at the last job.

Ideally, you go to whomever supplied the file and get them to properly escape those reserved entities. Use physical force, if necessary. :)

Failing that, see if they'll put them into a CDATA section.

And if that can't be done, you've got no choice but to fix it yourself by doing a string-based replace. For large files this can be troublesome, so you'll want to read it from disk in manageble "chunks". In each chunk, search for ampersand characters, then look at the following characters to make sure they aren't already "amp;" (I had a customer that mixed it up - sometimes they escaped it, sometimes they didn't, all in the same file. Joy.)

There's a edge case you have to consider, where the last character in a chunk is an ampersand. For this, set a flag to remind yourself to check the first few characters of the next chunk for the remainder of the entity.

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
Thanks, Chip!

I'm glad to see that I'm not missing something.

I know that XML is now the preferred format for transferring data from one system to another, but I now appreciate that it actually requires quite a bit of teamwork (between the sender and the recipient) to ensure that it is formatted in a way that is useful to both parties.
 
Just a thought here. WOuld not this be a perfect place to use regular expression (regx)? You can do an ISmatch and RReplace. Just a thought.
 
The problem is that no one reads the W3C spec. They buy a copy of "Learn XML Overnight in Your Sleep" and think they're an expert.

Wait until they start sending you dates. That's another area where people get it totally wrong.

The spec says that dates have to be in ISO-8601 format, but people invariably put them into a xs:string element, formatted however their PC is set up as. Makes for a lot of fun when someone in Europe has to send date info to someone in the US.
:)

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top