Parsing HTML for META tags

fatcodeguy · Dec 13, 2004

Hi, I'm trying to parse an html file for the META tags and get the name and content attributes.

Here's the code i use

Code:

/*
   Class:   Test.java
   Purpose: 
   Author:  
*/

import java.io.*;
import java.lang.*;
import java.util.*;

//html parse imports
import javax.swing.text.html.*;
import javax.swing.text.*;
import javax.swing.text.EditorKit;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.SimpleAttributeSet;

public class Test{
    
    //main function
    public static void main (String args[]) throws IOException{
           
           String filePath = "V:\\sys\\co_e.shtml";
           Vector metatags = getMetatags(filePath);
           System.out.println("Size: "+metatags.size());
           for (int ctr=0;ctr<metatags.size();ctr++)
           {
               Metatag tag = (Metatag)metatags.elementAt(ctr);
               System.out.println("Name: "+tag.getName());
               System.out.println("Content: "+tag.getContent());
               System.out.println("--------------------");
           }       
    }//end main
    
    public static Vector getMetatags(String filePath) {
           Vector metatags = new Vector();
           Metatag tag = null;
           try 
           {
               // Create a reader on the HTML content
               Reader reader = new FileReader(filePath);
       
               // Parse the HTML
               EditorKit kit = new HTMLEditorKit();
               HTMLDocument htmlDoc = (HTMLDocument)kit.createDefaultDocument();
               kit.read(reader, htmlDoc, 0);
       
               // Find all the META elements in the HTML document
               HTMLDocument.Iterator docIterator = htmlDoc.getIterator(HTML.Tag.META);
               System.out.println(docIterator.isValid());
               while (docIterator.isValid()) 
               {
                   System.out.println("got here");
                   AttributeSet sas = docIterator.getAttributes();
       
                   String name = (String)sas.getAttribute(HTML.Attribute.NAME);
                   String content = (String)sas.getAttribute(HTML.Attribute.CONTENT);
                   if (name != null && content != null) 
                   {
                       tag = new Metatag(name,content);
                       metatags.add(tag);
                   }
                   docIterator.next();
               }
           } 
           catch (BadLocationException e) {e.printStackTrace(System.err);}  
           catch (IOException e) {e.printStackTrace(System.err);} 
           
           // Return all found links
           return metatags;
    }
 
}//end class

It doesn't work for META or FORM (or, I think, for any empty tags), but it does for <A href= ...> and <b>...</b>.

Any suggestions?

Here's the Metatag class

Code:

import java.io.*;
import java.lang.*;

public class Metatag{

   //class variables
   private String name,content;
   
   //default constructor
   public Metatag(){
          name = new String();
          content = new String();
   }
   //additional constructor
   public Metatag(String name,String content){
          this.name=name;
          this.content=content;
   }
   
   //METHODS
   //accessor methods
   public String getName(){return name;}
   public String getContent(){return content;}

   //modifier methods
   public void setName(String newName){name=newName;}
   public void setContent(String newContent){content=newContent;}

   //other methods
   public String toString(){ return "<meta name=\""+name+"\" content=\""+content+"\">";}
   
}//end class

idarke · Dec 14, 2004

You need to create a callback method to handle the tags. Apparently, the default one does not handle meta tags.

Here's an example:

Code:

import java.io.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
import javax.swing.text.*;
import javax.swing.text.html.HTMLEditorKit;

public class Test extends HTMLEditorKit.ParserCallback
{
   public void handleText(char[] data, int pos)
   {
      System.out.println(new String(data));
   }

   public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
   {
      System.out.println("start: " + t);
   }

   public void handleEndTag(HTML.Tag t, int pos)
   {
      System.out.println("end: " + t);
   }

   public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos)
   {
      if (t == HTML.Tag.META)
      {
         String name1 = (String) a.getAttribute(HTML.Attribute.NAME);
         if (name1 != null)
         {
            System.out.println("META name1: " + name1);
         }

         String content1 = (String) a.getAttribute(HTML.Attribute.CONTENT);

         if (content1 != null)
         {
            System.out.println("META content1: " + content1);
         }
      }
   }

   public static void main(String args[])
   {
      String filePath = "c:\\test.html";
      getMetatags(filePath);
   }

   public static void getMetatags(String filePath)
   {
      try
      {
         Reader reader = new FileReader(filePath);

         ParserDelegator parser = new ParserDelegator();
         HTMLEditorKit.ParserCallback callback = new Test();
         parser.parse(reader, callback, false);         
       }
      catch (Exception e)
      {
         e.printStackTrace(System.err);
      }
   }

}

Hopefully this gets you going in the right direction. Note that the "charset" attribute in a meta tag will cause a ChangedCharSetException, unless you set a property to ignore it. See this link:

http://groups.google.com/groups?hl=...elm=c63m2l%24iem%241%40planja.arnes.si&rnum=9

idarke · Dec 14, 2004

Here's a snippet with the property added:

Code:

Reader reader = new FileReader(filePath);
ParserDelegator parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = new Test();
HTMLDocument doc = new HTMLDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);           
doc.setParser(parser);
parser.parse(reader, callback, false);

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Parsing HTML for META tags

fatcodeguy

Programmer

idarke

Programmer

idarke

Programmer

Similar threads

Part and Inventory Search

Sponsor