Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

using JTidy to parse html when Div and span avalible

Status
Not open for further replies.

karyani

Programmer
Dec 10, 2010
1
0
0
US
I have the bellow function after calling it the html get corrupt(notice how the </span> is moved after first line instead of at the end.). see below. help please.

public ByteArrayOutputStream convertToXHTML(String htmlString){
ByteArrayOutputStream xhtmlByteOutStream = new ByteArrayOutputStream();
if (htmlString != null && !htmlString.equals("")){
// Convert HTML to XHTML using JTidy API
Tidy tidy = new Tidy();

tidy.setXHTML(true);
tidy.setDocType("omit");
tidy.setQuoteMarks(false);
tidy.setQuoteAmpersand(false);
tidy.setQuoteNbsp(false);
tidy.setFixUri(true);
tidy.setMakeBare(true);
tidy.setJoinStyles(true);

tidy.parse(new ByteArrayInputStream(htmlString.toString().getBytes()),
xhtmlByteOutStream);
}
return xhtmlByteOutStream;

--------------------
the htmlString that is sent to the above funtion:
<html><head><title>Message Template</title><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/><style type="text/css">@page { size: letter; margin-top:1.0in;margin-bottom:1.0in;margin-left:1.0in;margin-right:1.0in;} body { line-height:100%;} </style></head><body><span style="font-size: 12px; font-family: Arial;"><span style="font-size: 12px; font-family: Arial;"><span style="font-size: 12px; font-family: Arial;">First line<br><div style="text-align: center;" align="center">second line<br></div>third line</span></span></span></body></html>


AFTER the call to the above function

<html xmlns="<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 26 Sep 2004), see />
<title>Message Template</title>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" /><style type="text/css">
/*<![CDATA[*/
@page { size: letter; margin-top:1.0in;margin-bottom:1.0in;margin-left:1.0in;margin-right:1.0in;} body { line-height:100%;}
/*]]>*/
</style>
</head>
<body>
<span style="font-size: 12px; font-family: Arial;"><span
style="font-size: 12px; font-family: Arial;"><span
style="font-size: 12px; font-family: Arial;">First line<br />
</span></span></span>
<div style="text-align: center;" align="center">second line<br />
</div>
third line
</body>
</html>
 
I've never worked with JTidy, but I see most people use the parseDOM method, like here

Cheers,
Dian
 
I think JTidy has done the best guess-work. The point is that span is an "inline" element, whereas div is itself a "block" level element. An inline element could contain inline elements but not block level element even in html. Hence the original html-string passed to it is already not a properly formed html serialized string. Apparently JTidy does what it can to salvage the situation and produces the best-guessed xhtml.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top