Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parse HTML document with JTidy

Status
Not open for further replies.

chigley

Programmer
Sep 30, 2002
104
GB
I need to parse an HTML document using JTidy. I have found the examples from the web and am making good headway. I just need a steer on how to traverse the xml document.

The HTML that I am trying to parse is like the following

<html>
<body>
<div class="person">
<H2>Charlie</H2>
</div>

<div class="address">
<H2>1 Acacia Avenue</H2>
</div>
</body>
</html>

Now the java I am using, uses a URL class, gets an input stream and then uses JTidy to create an XML dom. So far so good.

The only methods on the TidyDOM class that I can get is getElementsByTagName. Well here the tag name is "div", but what is I just want the div tags with a class of "person".

The code is so far :

u = new URL(URL);

is = u.openStream();

Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
NodeList divTags = tidyDOM.getElementsByTagName("div");

for (int i=0;i<divTags.getLength();i++){
//This is where I am having problems
}


Please not all the usual exception handling is in place, it is just omitted for clarity.

Thanks in advance.

Charlie
 
From the top of my head

Code:
for (int i=0;i<divTags.getLength();i++){
//This is where I am having problems
if (divTags.item(i).getAttributes().getNamedNode("class")!=null)

// do stuff
}

Cheers,
Dian
 
This is good. The method call getNamedNode does not exist. The closest match I could find was getNamedItem which seems to get a nodelist to loop through.

Is there a way to only return div tags that have a class="person" attribute to return a list of person nodes?
 
OK I got further. Still up against problems though. I have managed to drill down to the level of the html document I want ; for example

<div class="parentdiv">
<div class="childdiv">

some text
some more text
</div>
</div>

Now lets say I want to get all the text stored in between the inner div tags. This is where I am having problems.

So I have enumerated all the div tags, and checked the class attribute to get the parentdiv tag. Then I use the getFirstChild method to get the childdiv tag. How do I get the text in the middle?

I tried getTextContent() to no avail. Any ideas?
 
I can quantify the problem still further.

If the html is

<div class="somedata">Some Data</div>

the getNodeValue() method returns "Some Data".

What if the html is

<div class="somedata">

Some data

Some more data

</div>

I keep getting null pointer exceptions if I try to use the getNodeValue() method. The getTextContent() method also throws a null pointer exception.

Just how do I get at the data between the div tags?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top