SparceMatrix
Technical User
I originally started applying this module to solve a problem I posed here,
"HTML to Text with PERL. Is it possible?"
I got some good code and some good advice, but I am having a new problem right away.
Here is the basic code again,
The code I've added in red is to pose this new problem, Here is OrderedListTEST.html,
Now what I would like to get from this in my output to screen and the text file, OderedListTEST.txt is,
What I get instead is something like this,
In other words, the ordered list is not parsed into text. How do I make the changes to my $parser object in order to do this? Is it possible?
There are some other anomallies having to do with how the editor saves the HTML file that seem to effect line breaks and spacing. I may have to address that too, but for now the Ordered List parsing will do.
As usual, any and all tips and clues would be appreciated.
"HTML to Text with PERL. Is it possible?"
I got some good code and some good advice, but I am having a new problem right away.
Here is the basic code again,
Code:
use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
start_document_h => [\&init, "self"] );
[red][b]open(my $file, "<", "OrderedListTEST.html") || die;[/b][/red]
$parser->parse_file($file);
print @{$parser->{_private}->{text}};
[red][b]my $txt = 'OrderedListTEST.txt';[/b][/red]
[red][b]print WITDB @{$parser->{_private}->{text}};[/b][/red]
sub init
{
my ( $self ) = @_;
$self->{_private}->{text} = [];
}
sub text_handler
{
my ( $self, $text) = @_;
push @{$self->{_private}->{text}}, $text;
}
The code I've added in red is to pose this new problem, Here is OrderedListTEST.html,
Code:
<body>
<ol start="123">
<li></li>
<p>Entry 1</p>
<p>Subentry 1</p>
<li></li>
<p>Entry 2</p>
<li></li>
<p>Entry 3</p>
<li></li>
<p>Entry 4</p>
<li></li>
<p>Entry 5</p>
<li></li>
<p>Entry 6</p>
</ol>
</body>
</html>
Now what I would like to get from this in my output to screen and the text file, OderedListTEST.txt is,
Code:
123.
Entry 1
Subentry 1
124.
Entry 2
125.
Entry 3
126.
Entry 4
127.
Entry 5
128.
Entry 6
What I get instead is something like this,
Code:
Entry 1
Subentry 1
Entry 2
Entry 3
Entry 4
Entry 5
Entry 6
In other words, the ordered list is not parsed into text. How do I make the changes to my $parser object in order to do this? Is it possible?
There are some other anomallies having to do with how the editor saves the HTML file that seem to effect line breaks and spacing. I may have to address that too, but for now the Ordered List parsing will do.
As usual, any and all tips and clues would be appreciated.