Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HTML::Parser parsing details, example: <OL>. Simple code included 2

Status
Not open for further replies.

SparceMatrix

Technical User
Mar 9, 2006
62
US
I originally started applying this module to solve a problem I posed here,

"HTML to Text with PERL. Is it possible?"

I got some good code and some good advice, but I am having a new problem right away.

Here is the basic code again,

Code:
use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );

[red][b]open(my $file, "<", "OrderedListTEST.html") || die;[/b][/red]
                                                                                
$parser->parse_file($file);
 
print @{$parser->{_private}->{text}};

[red][b]my $txt = 'OrderedListTEST.txt';[/b][/red]
[red][b]print WITDB @{$parser->{_private}->{text}};[/b][/red]
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

The code I've added in red is to pose this new problem, Here is OrderedListTEST.html,

Code:
<body>
<ol start="123">
  <li></li>
  <p>Entry 1</p>
  <p>Subentry 1</p>
    <li></li>
  <p>Entry 2</p>
  <li></li>
  <p>Entry 3</p>
  <li></li>
  <p>Entry 4</p>
  <li></li>
  <p>Entry 5</p>
  <li></li>
  <p>Entry 6</p>
</ol>
</body>
</html>

Now what I would like to get from this in my output to screen and the text file, OderedListTEST.txt is,

Code:
123.
Entry 1
Subentry 1
124.
Entry 2
125.
Entry 3
126.
Entry 4
127.
Entry 5
128.
Entry 6

What I get instead is something like this,

Code:
  Entry 1
  Subentry 1
    
  Entry 2
  
  Entry 3
  
  Entry 4
  
  Entry 5
  
  Entry 6

In other words, the ordered list is not parsed into text. How do I make the changes to my $parser object in order to do this? Is it possible?

There are some other anomallies having to do with how the editor saves the HTML file that seem to effect line breaks and spacing. I may have to address that too, but for now the Ordered List parsing will do.

As usual, any and all tips and clues would be appreciated.
 
Try dumping the $parser structure to see if there's anything in there that you can read back

Code:
use Data::Dumper;
....
print Dumper $parser;

HTH
--Paul

Paul
------------------------------------
Spend an hour a week on CPAN, helps cure all known programming ailments ;-)
 
OK, I tried that. I just added those two lines to the above code and got,

Code:
$VAR1 = bless( {
                 '_hparser_xs_state' => \25350180,
                 '_private' => {
                                 'text' => [
                                             '
',
                                             '

',
                                             '
',
                                             '
  ',
                                             '
  ',
                                             'Entry 1',
                                             '
  ',
                                             'Subentry 1',
                                             '
    ',
                                             '
  ',
                                             'Entry 2',
                                             '
  ',
                                             '
  ',
                                             'Entry 3',
                                             '
  ',
                                             '
  ',
                                             'Entry 4',
                                             '
  ',
                                             '
  ',
                                             'Entry 5',
                                             '
  ',
                                             '
  ',
                                             'Entry 6',
                                             '
',
                                             '
',
                                             '
',
                                             '
'
                                           ]
                               }
               }, 'HTML::Parser' );

It's all a complete mystery to me what this might mean, but my immediate impression is that there is not much of value to me. If there is anything useful there to my problem, please point it out.

Anyone else?

Any and all tips and clues would be appreciated.
 
You just need a start handler to detect the start of tags.

Code:
$parser->handler( start => \&start_handler, "self,tagname,attr");

sub start_handler {
  my ($self, $tagname, $attr) = @_;
  if ($tagname eq 'ol') {
     $self->{_private}->{ol} = $attr->{start} || 1;
  } elsif ($tagname eq 'li') {
     push @{$self->{_private}->{text}}, $self->{_private}->{ol}++;
  }
  return;
}
 
Great! Thank you. That works like a charm. In case anyone missed that, here are the changes to the above code,

Code:
use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );

[red][b]open(my $file, "<", "OrderedListTEST.html") || die;[/b][/red]

[green][b]$parser->handler( start => \&start_handler, "self,tagname,attr");[/b][/green]
                                                                                
$parser->parse_file($file);

print @{$parser->{_private}->{text}};

[red][b]my $txt = 'OrderedListTEST.txt';[/b][/red]
[red][b]print WITDB @{$parser->{_private}->{text}};[/b][/red]
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

[green][b]sub start_handler {
  my ($self, $tagname, $attr) = @_;
  if ($tagname eq 'ol') {
     $self->{_private}->{ol} = $attr->{start} || 1;
  } elsif ($tagname eq 'li') {
     push @{$self->{_private}->{text}}, $self->{_private}->{ol}++;
  }
  return;
}[/b][/green]

I have the PERL CD Bookshelf and I don't see anything in there that might be a clue as to how to navigate that module to make those changes myself. The fact is, I have managed to apply at least one module without any assistance, but I haven't the slightest idea what is going on here. If somebody could walk me through the motivations for those additions, I would appreciate it. Even better would be some reference to entries in CD Bookshelf or an online tutorial. By way of tutorials for Object Orientation in PERL, I recommend the following at


And more generally for PERL,

 
The HTML::parser module is one of the harder to use modules so don't feel bad about not understanding how to use it or if the documentation seems sparse. It confuses the heck out me too.

- Kevin, perl coder unexceptional!
 
The HTML::parser cpan module is wierd, convoluted, badly documented, and did I say wierd? I learned the basics simply be reading the documentation at:


And by going through the examples that they provide:


I started out with a supposedly simple problem: Parsing out the contents of a form tag in an html document. It took me an entire weekend of going through the examples and putting together pieces of code before I understood what the heck was going on. I don't recommend that you need to do that necessarily, but it would almost certainly serve you to at least read the docs and a couple of the examples and try to figure out what's up with this module. Especially if you are going to be actively using it for a project for the forseeable future.
 
Thanks, all. The documentation at CPAN looks the same as the stuff I got with ActiveState's distribution. I think I have some basic gaps in understanding the use of reference/dereference variables. The idea of a "handler" seems to be another module based device that I can find no documentation to explain.

If I bump into some insights, I'll be sure to update this post.
 
the documentation comes with the modules so it should always be the same regardless of the perl distribution.

- Kevin, perl coder unexceptional!
 
PaulTEG said:
Try dumping the $parser structure to see if there's anything in there that you can read back

I've tried this before as well Paul, and it really doesn't illuminate much about HTML::parser. If anything it's just one big mess that really brings to light why there should be detailed documentation for any block box that you use. You aren't supposed to try to figure it out by dumping, but I admit that I hoped that would work as well.

There is something about the methodology used by this module that is extremely non-standard. I believe that the documentation could use a little synopsis describing the theory behind their approach. But I get the feeling that it is/was ultimately a hack job, and that may be why the docs aren't that great.

Nevertheless, once you learn the module, you can begin to gleam info from the perldoc. But even now I must admit that at least 80% of it is gibberish to me still.
 
MillerH said:
Nevertheless, once you learn the module, you can begin to gleam info from the perldoc. But even now I must admit that at least 80% of it is gibberish to me still.

I'm glad to hear somebody besides me muttering to myself say that.
 
It's open source, you can contribute ... ;-)

Paul
------------------------------------
Spend an hour a week on CPAN, helps cure all known programming ailments ;-)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top