HTML to Text with PERL. Is it possible? 4

SparceMatrix · Apr 20, 2006

I need to convert a collection of HTML files to Text files. More precisely, I need to convert .html to .txt files, but I need them parsed first so that what you see in the browser from the .html file you now see in the .txt file.

I'm already doing this with a script using MS Word's scripting interface and the Win32 module, but I'd like something that functions with a little more economy.

Is it possible?

SparceMatrix · May 24, 2006

OK, thanks for that tidbit. I'll be happy to use it to solve my problem. I'm already thinking of other places I can use it too. I still think it is worthwhile question to ask, "Why does this parser not do it when my browser will?" It's ironic, but there are times when I want to put some extra white spaces in and I can't do it so that it shows in a browser. Also, my Dreamweaver 8 leaves " " whenever I create an empty <p></p> like this: <p> </p>

Is there some way I can open the file to get the desired results? Maybe it is a character issue. I tried this code I found accidentally,

Code:

open(my $file, "<:utf8", "MyFile.htm") || die;

But that made no difference. Could it be an I/O Layer problem? Is there a module to use here: "VIA(module)".

KevinADC · May 24, 2006

The browser only collapses extra white space, it does not remove white space from the document. So why should the HTML:

arser module remove extra white-space from the document?

SparceMatrix · May 25, 2006

$text =~ s/\s+/ /g;" will also remove new line characters, so it may be better to use "$text =~ s/[ ]+/ /g;".

SparceMatrix · Jun 26, 2006

Code:

#!/usr/bin/perl
use strict;
use HTML::Parser;
my $file = 'MyTESTHTMLFile.htm';

#### Or use this $file
# open(my $file, "<:utf8", "020419IT.htm") || die;

my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
                                
my $txt = 'PerlToHTMLTEST.txt';

open ( PTHTMLTEST, '>', $txt) or die $!;
 
$parser->parse_file($file);
 
print PTHTMLTEST @{$parser->{_private}->{text}};
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

I am still trying to make use of this and I am going through it line by line to try and figure out exactly what is going on and, hopefully, make better use of the HTML:

arser module.

I am stuck on "_private". What does that refer to? None of my references on programming objects or classes explain it.

TrojanWarBlade · Jun 26, 2006

_private" is merely a key to the hash.
It's simply used as a name for the data and has no greater meaning than that.
It could have been anything but the author obviously chose the name to reflect the fact that he would prefer that data to be considered "private".

Trojan.

SparceMatrix · Jun 26, 2006

What is doing in between $self and text? These are defined by Parser. Is there a feature of object programming that I am missing here?

SparceMatrix · Jun 26, 2006

I am having a real problem figuring out these modules. It looks like the O'Reilly book, Learning Perl Objects, References, and Modules might help me out. Anyone have any experience with it?

wardy66 · Jun 26, 2006

Have you had a look at the supplied help.

From *nix systems, type

Code:

perldoc perlboot

SparceMatrix · Nov 16, 2006

As I posted above, I have been able to successful demonstrate this code also offered above using a file:

Code:

use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
                                                                                
$parser->parse_file($file);
 
[red][b]print @{$parser->{_private}->{text}};[/b][/red]
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

But now I need to make a change that I am not able to implement with this code. I need to have the parsed text returned as a string rather than an array, which is what I assume is returned by "
print @{$parser->{_private}->{text}};".

Does anyone have any idea how I would do that? Any and all tips or clues would be appreciated.

MillerH · Nov 16, 2006

Unless I'm missing something, what you are asking for is rudimentary.

Code:

my $string = join '', @{$parser->{_private}->{text}};

SparceMatrix · Nov 16, 2006

Thank you. Yes, of course. But since I am completely clueless about the rest of "... @{$parser->{_private}->{text}};", I thought there might be some other way of accessing the results as a string native to the module. I should have mentioned the obvious "join" of the array.

KevinADC · Nov 16, 2006

you could use a string instead of an array to store the text in the hash reference and not have to use a join() function to convert the array into a string:

Code:

my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
                                                                                
$parser->parse_file($file);
 
print $parser->{text};
 
sub init
{
   my ( $self ) = @_;
   $self = '';
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    $self->{text} .= $text;
}

- Kevin, perl coder unexceptional!

SparceMatrix · Nov 18, 2006

Well, after all this, I finally managed to bump into some information in my PERL CD Bookshelf 4.0. Those who have it might try looking in "The PERL Cookbook" Section 20.5.1,

Cookbook said:
If you have an external formatter like lynx, call an external program:

$ascii = `lynx -dump $filename`;
If you want to do it within your program and don't care about the things that the HTML::FormatText formatter doesn't yet handle well (tables and frames):

use HTML::FormatText 3;
$ascii = HTML::FormatText->format_file(
$filename,
leftmargin => 0, rightmargin => 50
);

So, FormatText would do the job, I bet. Unfortunately, I've tried searching for it and I can't find the module. After looking at CPAN's information on the module, I conclude there is not a release available for Windows. Since LYNX seems to be dependent on this module, I guess I can assume LYNX is not available for Windows either.

Can anyone confirm this or show me where I might get it?

HTML:

arser is just not working out.

MillerH · Nov 18, 2006

Problem 1 - Finding HTML::FormatText

Yes, HTML::FormatText is available for windows, it is simply in the package HTML::Format.

Whenever you can't find a module using PPM, do a search for it on CPAN. This is because some modules are part of larger packages, as is the case with this one. We find HTML::FormatText here:

http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/FormatText.pm

You will notice in the link that it is part of HTML-Format-2.04. You therefore simply need to install this larger package to get HTML::FormatText.

Problem 2 - How to use HTML::FormatText

Well, I have never had a need for this particular functionality. Yet, I took a moment to run the quick example that they provided in their source, and it seems to work pretty well.

Code:

use HTML::FormatText;

my $text = HTML::FormatText->format_file(
    "test.html",
    leftmargin => 5, rightmargin => 50
);

print "$ascii";

Just note that this module currently only supports limitted HTML. So forms and tables and other more advanced HTML will either be ignored or maybe will throw errors.

Problem 3 - HTML:arser

Being new to Tek-tips, I only just now read this thread concerning your original project. It definitely sounds like HTML:

arser was not the module that you were looking for given your needs. I'm still not familiar with exactly what your overall goal is, so I can't say whether it might have worked given more familiarity with it's functionality. But I can say that HTML::FormatText probably has a better chance of matching your requirements.

Good Luck.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

HTML to Text with PERL. Is it possible? 4

SparceMatrix

Technical User

SparceMatrix

Technical User

KevinADC

Technical User

SparceMatrix

Technical User

SparceMatrix

Technical User

TrojanWarBlade

Programmer

SparceMatrix

Technical User

SparceMatrix

Technical User

wardy66

Programmer

SparceMatrix

Technical User

MillerH

Programmer

SparceMatrix

Technical User

KevinADC

Technical User

SparceMatrix

Technical User

MillerH

Programmer

Similar threads

Part and Inventory Search

Sponsor