Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Converting XML to Cursor Stopped due to invalid unicode Character 1

Status
Not open for further replies.

Sng1

Programmer
Aug 3, 2021
65
IN
I have a xml file. I am using 'MSXML2.DomDocument' to read the document. But if xml contains some invalid character , MSXML2.DomDocument generated parsing error. I used filetostr() and then strtran() function to replace string and then filetostr() but still that character is not removed.


Code:
 
Sng1, can you provide a sample XML document that raises the error?
 
Sorry I had attached code along with File but now I see file is not attached and even code is blank. As I remove  from xml file , document is successfully parsed.
*******************************************************
* ReadXML
*
* Read an XML file of a structured format that would
* be similiar to an INI settings file, traverses
* through the entire structure as an example
*******************************************************
Create Cursor tmpRange(Agroup c(50),Pagroup c(50),SORTPOSITION N(6))
bb=ReadXML('C:\Users\om\Downloads\Datagroup.xml')

Function ReadXML
Lparameters pcXMLFile
Local oXML, oRootNode, cParentName, cValue, cName, nType, nNumNodes, ;
cRootTagName, oNodeList, oNode, bHasChild, cTagName, ;
oChildNodeList, nChildLen, nPass, oChildNode, oAttributeList, ;
cTextData, nTextDataLen, cAttrName, cAttrValue, nNumAttr


* Start out by creating the actual xml parser object
oXML = Createobject('MSXML2.DomDocument')

* Wait for the document to be parsed and loaded
oXML.Async = .F.

* Load the document into the object, if this was a stream instead
* of a file name, we would use loadXML(cCharStream)
If oXML.Load(Filev)
*- Good No Error reported
Else
* The document failed to load.
Local strErrText
*LOCAL xPE As MSXML2.IXMLDOMParseError

* Obtain the ParseError object
xPE = oXML.parseError
With xPE
strErrText = "Your XML Document failed to load" + ;
"due the following error." + Chr(13) + ;
"Error #: " + Transform(.errorCode) + ": " + xPE.reason + ;
"Line #: " +Transform(.Line) + Chr(13) +;
"Line Position: " + Transform(.linepos )+ Chr(13) + ;
"Position In File: " + Transform(.filepos )+ Chr(13) + ;
"Source Text: " + .srcText + Chr(13) + ;
"Document URL: " + .url
ENDWITH
Messagebox(strErrText)
Retu

Endif

* Get the root element
oRootNode = oXML.documentElement

* What is the root tag name?
cRootTagName = oRootNode.tagName

* Get all the nodes in the document with the special '*'
* parameter, we could just pass in a tag name to get the
* node list for that specific tag
oNodeList = oRootNode.getElementsByTagName("*")

* How many nodes did we retrieve
nNumNodes = oNodeList.Length

* Go through all the nodes in the NodeList.
* Note that Attribute and Character/Text Data is NOT
* counted as part of this list, you must get that data
* separately, this list only contains tag elements
* Note that this uses C like array positioning by
* starting at zero
For nPos = 0 To (nNumNodes-1) Step 1
* Get the next node in the list
oNode = oNodeList.Item(nPos)

* What is the value of this node, if it is an element
* then this value is the tag name
cParentName = oNode.nodeName

* Does this node have any children?
bHasChild = oNode.hasChildNodes()

* What is the node type, element or text?
nType = oNode.nodeType
If cParentName = 'GROUP'
Select tmpRange
Append Blank
Replace Agroup With oNode.Attributes.getNamedItem('NAME').Text
Strtofile('Group : ' +oNode.Attributes.getNamedItem('NAME').Text + Chr(13),'Om.txt',1)
Endif
If nType = 1
* This is an element/tag so it may have
* attributes. We can get those attributes
* by name or in a list.
* Since this example function traverses thru
* the xml tree, it would not be very efficient
* to query every single node for a particular
* attribute, this is just to show how it could
* be done.

* if the attribute does not exist, returns .NULL.
* otherwise we get the attribute value
cAttrValue = oNode.getAttribute("my_attribute")
* cAttrValue = oNode.getAttribute("PARENT")
* We could also get a NamedNodeMap accessing
* the attributes property
oAttributeList = oNode.Attributes

* how many attributes do we have
nNumAttr = oAttributeList.Length
* Get the attribute using the list
cAttrValue = oAttributeList.getNamedItem("my_attribute")
*STRTOFILE(cParentName + '---' + ALLTRIM('') + CHR(13)+ ',','Om.txt',1)


Endif

If bHasChild
* Ok, we know we have children but what type are they?
* Just test the first one to see if it is something
* other than an element and if so, get it
If oNode.firstChild.nodeType != 1
* We know we have something other than an element, get
* the tag name of the element we are parsing
cTagName = oNode.tagName

* Get the node list and determine how man child
* nodes this element has
oChildNodeList = oNode.childNodes
nChildLen = oChildNodeList.Length

* Go through all child nodes and grab the non-element
* data to do with what you like
For nPass = 0 To (nChildLen-1) Step 1
oChildNode = oChildNodeList.Item(nPass)
cValue = oChildNode.nodeValue
cName = oChildNode.nodeName
nType = oChildNode.nodeType
bHasChild = oChildNode.hasChildNodes()

* For now just look for text types, other
* types can be added later if needed
Do Case
Case nType = 3
* Text node
cTextData = oChildNode.Data
nTextDataLen = oChildNode.Length
Case nType = 4
* CData Section node
cTextData = oChildNode.Data
nTextDataLen = oChildNode.Length
Otherwise
* Some other node we don't care about
* right now
Endcase
Strtofile(cTagName + ':' + Alltrim(cTextData) + ',','Om.txt',1)
*?cTextData
Do Case
Case Upper(Allt(cTagName)) = 'PARENT'
Select tmpRange
Replace Pagroup With Alltrim(cTextData)
Case Upper(Allt(cTagName)) = 'SORTPOSITION'
Select tmpRange
Replace &cTagName With Val(cTextData)
Endcase
Endfor
Strtofile(Chr(13),'Om.txt',1)
Endif
Endif
Endfor
Select tmpRange
Brow
Return 0
Endfunc
 
 https://files.engineering.com/getfile.aspx?folder=07dd4330-52c9-4c07-9f8e-629a0fbd6544&file=DataGroup.xml
Sng1 said:
As I remove  from xml file , document is successfully parsed.
That's a process you must execute. The XML document that you shared is not well-formed due precisely to the inclusion of that invalid entity. Any XML parser will complain and won't process it.

The <U+0004> character (as other characters from range <U+0000> to <U+001F>, excluding CR, LF, and HTAB characters) is only allowed in XML documents of version "1.1", which is not the case (an XML document, by default, is "1.0").

&#4; must be removed at the source. If the creator of the document is not able or willing to correct the document, then you must do it yourself.
 
Yes , I tried to do that but unable to get it done . I have added below lines but the new file still contains invalid character.

Filev = 'C:\Users\om\Downloads\Datagroup1.xml'
lcMyString = FiletoStr(pcXMLFile)
STRTRAN(lcMyString, '&#4;','')
STRTOFILE(lcMyString, Filev)
 
Sng1,

The file is in UNICODE. If you want to locate and remove a string in the file, first you must encode it as UNICODE.

Code:
LOCAL XML AS MSXML2.DOMDocument60
LOCAL XMLFile AS String
LOCAL XMLFileCorrected AS String
LOCAL XMLSource AS String
LOCAL InvalidXMLChar AS String

m.XML = CREATEOBJECT("MSXML2.DOMDocument.6.0")

m.XML.Async = .F.

m.XMLFile = GETFILE("xml")

IF !m.XML.Load(m.XMLFile)

	m.InvalidXMLChar = [highlight #FCE94F]STRCONV("&#4;", 5)[/highlight]

	m.XMLFileCorrected = ADDBS(JUSTPATH(m.XMLFile)) + JUSTSTEM(m.XMLFile) + "-corrected." + JUSTEXT(m.XMLFile)
	
	STRTOFILE(STRTRAN(FILETOSTR(m.XMLFile), m.InvalidXMLChar, ""), m.XMLFileCorrected)

	IF !m.XML.Load(m.XMLFileCorrected)

		? "Sorry, failed to load even after removing invalid characters..."

	ELSE

		? "Ok, successfully parsed after remmoving invalid characters."

	ENDIF

ELSE

	? "Loaded without problems."

ENDIF
 
Thanks a lot atlopes, it worked . Is there any way to get list of invalid character if it occurs in xml file so that they can be removed thereafter. Because this file is coming from third party software and there is possibility to get other invalid unicode character depending on the data of the user
 
I know for sure that the ampersand can't be used as an element in the XML-document.
Found out the hard way when sending files to our version of IRS. Two of the companies that use my software had ampersands in their company name, like "Johnson & Sons". Had to change it to "Johnson and Sons" :)
 
Is there an additional file with xsd or dtd extension?

Because characters or combination of bytes are not only invalid by codepage (this XML is UCS-2 LE with BOM).
As Dan said in some cases even characters valid by codepage, let it be latin-1, windows-1252, or UTF-8, can be disallowed. And & isn't aa character not contained in UCS2, it's even existing in ANSI.

It becomes quite clear with the characters < and >: They surely are not special, but they are in use for the XML tags, so if a value has them they are expected to be written in as &lt; (lower than) and &gt; (greater than), also & can be in XML as &amp; (ampersand), there are many places in your file having ampersand encoded this way and parsing that causes no problem.

The &#4; is HTML entity for CHR(4), control character for EOT (End of Transmission) and such control characters are usually not allowed in xs:string. EOT in a name field looks like an error in the source data, even before it was turned into XML, as EOT has nothing to do in textual values like a company name. A user might have unintentionally pressed CTRL+D and as it's non-visible not corrected.

There might be a special, perhaps even non-standard user defined XML type (remember the X is for eXtensible), which also allows CHR(4) or any byte in its HTML entity form of &#4; or similar. Therefore the question about xsd or dtd. When the XML parser is pointed to them it might get through as valid.

It smells more like a very losely programmed XML generation where all data is put through converting any non-printable character into &#999; Instead of that ANSI control characters - bytes 0-31 should better be skipped overall. TAB,CR, and LF might be allowed as exceptions, even without turning them into their HTML entities.

There is no general list of invalid characters, but for this case you could filter out any &#1; to &#255;

The final conversion to UCS-2 must be considered, too, when you want to find them, as atlopes showed. More general you have to search unwanted characters or HTML entities in the codepage of the document. It's not always defined with a BOM, it can even be hard to make it out, when it's written in [tt]<?xml version="1.0" encoding="encoding name" standalone="no" ?>[/tt] because the encoding name is encoded in the codepage itself, you don't find UCS-2 LE in an XML file using VFPs ANSI string functions as each letter is 2 bytes in UCS-22. It's one of the things that's weird about XML specs, as it means you have to know in advance what is the encoding to be able to verify and confirm it.

We're still often quite lucky in VFP as UTF-8 is often used and is the same as ANSI for any characters below CHR(128) and that is enough for most texts, even not only English. You don't use VFP XML functions but use MSXML2.DomDocument which solves the problem of handling encoding detection, I guess MSXML also takes into account the BOM of the file (the first few bytes define the encoding). Indeed a BOM is a good idea as its a binary information before the first net XML character and solves the recursive determination of encoding, but BOM isn't a standard and many parsers would even error about it.

To summarize: The only way to enhance your code would be to look for any case the Load returns .F., take the lineno and offset to look for the invalid character(s) and include them in a preprocessing list. From this case I'd even add any &#999; as forbidden. But you can forget to have a general way of removing invalid data aside from letting the parser tell you in the error. As you can see from this case, it's not just a single byte, not even a single UCS-2 character, but an HTML entity, so even when you'd parse out the error position in the XML it needs your intervention to fix the problem.

To be clear, this XML problem isn't caused by you and doesn't point out MSXML parsing has problems, but you can't have a bulletproof solution to this, if only because XML is extensible and any XML document can come up with a new problem you haven't faced before if people do their own thing.

You could also ask for a grammar and spell checker and correction module that lets any text pass a lector. Or even shorter want a working autocorrect for natural language.



Chriss
 
Sng1 said:
Is there any way to get list of invalid character if it occurs in xml file so that they can be removed thereafter.

As I told you before, all codepoints from 0 to 31 excluding 9, 10, and 13, are invalid in XML documents of version "1.0".

But I would suggest that you deal with the different situations as they arrive. Even for the codepoint 4, &#4; is just one of the possible notations in an XML document. It is equivalent to &#x4;, or to &#0004;, or to &#x4;, or to &#x0004;. In fact, the use of arbitrary lengths of leading zeros in the decimal or hexadecimal notations is allowed.
 
Thanks Dan, Chris and Atlopes for your valuable suggestions.
 
Still sorry for being longwinded. I'd be rigorous and disallow any &#...; or with * wildcard &#*;

But atlopes still is right about any variation. If someone does not convert all control characters to HTML entities they could be in an XML document still, really just a stray byte 0-31, except 9,10, and 13. On the other hand such bytes can be part of valid 2-byte Unicode characters, so you can't just remove them.

So the first part of an XML grammar correction would need to determine the codepage and parse the file in terms of valid encoded characters as the most detailed level, not byte by byte. And after that, look for invalidly HTML-entity encoded characters, but still in the XML enoding. In short you need better capabilities than VFPs limited double-byte character string functions.

You may try to do the inverse conversion of the XML file to ANSI only with STRCONV and then can use the usual VFP string functions to remove bytes 0-31 and any HTML entity starting with &# and ending in ;.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top