Problem parsing XML files using JAXP

sahil77 · Mar 3, 2009

The XML files we are dealing for has been made from the database and maybe the entities are not taken care of while making the XML's. Here I give you the first few tag in which we encounter the exception:

Code:

<n-load>
<n-document guid="xxxx" control="xx">
<n-metadata><metadata.block>
<md.legacy.status.code>0</md.legacy.status.code><md.identifiers><md.cites>
<md.first.line.cite>&ensp</md.first.line.cite>
<md.second.line.cite>xx.........

It is important to point out that "&ensp" is where the exception is thrown by the parser, and the smallest files are 20Mb which contains endless references like these. It is important for me to point out here that the start of the file is as pasted above. It does not have any <DOCTYPE.......> reference which is the source for error's.
We have tried using the filter interface and LexicalHandler interface but still we encounter the same error. According to me this error is the parser exception. As "JAXP" uses Apache XERCES as the underlying parser and using any of these interfaces we cannot change the parser configuration. Our plan to use filter was based on the fact that maybe introducing a new layer between the parser and the XML file we may remove these entities to forward it to the parser, but this is also not working. Similar is the case with LexicalHandler. I hope we have a solution to this problem.
Thanks
Regards
Abhishek Agarwal

k5tm · Mar 3, 2009

A proper entity would end with a semicolon, perhaps.

Tom Morrison

tsuji · Mar 3, 2009

[1] Since &ensp; is a known entity (but not recognized as xml built-in entity), you have no choice either to [1.1] repair the string at the dbase outputting level, or [1.2] repair the character string at the level of processing before feeding to sax/dom parser, or [1.3] repair the file and re-save it in an alternative name to preserve the original.

[2] Suppose it is not viable at the dbase outputting level such as it being out of your control, you can do something like this for [1.2] or [1.3] level.
[tt]
[green]//import java.util.regex.*;
//suppose String sin being the string read via FileReader[/green]
String spattern="(&ensp)(?!

";
String sreplacement="<!CDATA[$1]]>"; //either this
//String sreplacement="$1;"; //or this
Pattern rxc=Pattern.compile(spattern);
String sin_repaired=rxc.matcher(sin).replaceAll(sreplacement);
//either perisit sin_repaired by saving to a file
//or use sin_repaired to feed in to parser afterward...
[/tt]
[3] But since the xml file itself does not have any referenece to dtd, there would be some resolution problem ahead. In that case, the simplest is to resolve it and repair it in one go. That would mean simply alter the [2] to do it in one go.
[tt]
[green]//import java.util.regex.*;
//suppose String sin being the string read via FileReader[/green]
String spattern="&ensp(?!

|&ensp;";
String sreplacement=[blue]" "; //repaired & resolved[/blue]
Pattern rxc=Pattern.compile(spattern);
String sin_repaired=rxc.matcher(sin).replaceAll(sreplacement);
//either perisit sin_repaired by saving to a file
//or use sin_repaired to feed in to parser afterward...
[/tt]

tsuji · Mar 4, 2009

[2.1] Upon re-read what I posted, I said
>[self]
String sreplacement="<!CDATA[$1]]>"; //either this
//String sreplacement="$1;"; //or this

The uncommented version is meant only if &ensp is something literally such (equivalent to &ensp). But since there is a high probability of it is actually meant for half em-space, hence, the second (commented) line should be chosen. That two options are not of the same semantics.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Problem parsing XML files using JAXP

sahil77

Programmer

k5tm

Programmer

tsuji

Technical User

tsuji

Technical User

Similar threads

Part and Inventory Search

Sponsor