Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HTML to Text with PERL. Is it possible? 4

Status
Not open for further replies.

SparceMatrix

Technical User
Mar 9, 2006
62
US
I need to convert a collection of HTML files to Text files. More precisely, I need to convert .html to .txt files, but I need them parsed first so that what you see in the browser from the .html file you now see in the .txt file.

I'm already doing this with a script using MS Word's scripting interface and the Win32 module, but I'd like something that functions with a little more economy.

Is it possible?
 
OK, thanks for that tidbit. I'll be happy to use it to solve my problem. I'm already thinking of other places I can use it too. I still think it is worthwhile question to ask, "Why does this parser not do it when my browser will?" It's ironic, but there are times when I want to put some extra white spaces in and I can't do it so that it shows in a browser. Also, my Dreamweaver 8 leaves "&nbsp;" whenever I create an empty <p></p> like this: <p>&nbsp;</p>

Is there some way I can open the file to get the desired results? Maybe it is a character issue. I tried this code I found accidentally,

Code:
open(my $file, "<:utf8", "MyFile.htm") || die;

But that made no difference. Could it be an I/O Layer problem? Is there a module to use here: "VIA(module)".
 
The browser only collapses extra white space, it does not remove white space from the document. So why should the HTML::parser module remove extra white-space from the document?
 
$text =~ s/\s+/ /g;" will also remove new line characters, so it may be better to use "$text =~ s/[ ]+/ /g;".
 
Code:
#!/usr/bin/perl
use strict;
use HTML::Parser;
my $file = 'MyTESTHTMLFile.htm';

#### Or use this $file
# open(my $file, "<:utf8", "020419IT.htm") || die;

my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
                                
my $txt = 'PerlToHTMLTEST.txt';

open ( PTHTMLTEST, '>', $txt) or die $!;
 
$parser->parse_file($file);
 
print PTHTMLTEST @{$parser->{_private}->{text}};
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

I am still trying to make use of this and I am going through it line by line to try and figure out exactly what is going on and, hopefully, make better use of the HTML::parser module.

I am stuck on "_private". What does that refer to? None of my references on programming objects or classes explain it.
 
_private" is merely a key to the hash.
It's simply used as a name for the data and has no greater meaning than that.
It could have been anything but the author obviously chose the name to reflect the fact that he would prefer that data to be considered "private".


Trojan.
 
What is doing in between $self and text? These are defined by Parser. Is there a feature of object programming that I am missing here?
 
Have you had a look at the supplied help.

From *nix systems, type
Code:
perldoc perlboot
 
As I posted above, I have been able to successful demonstrate this code also offered above using a file:

Code:
use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
                                                                                
$parser->parse_file($file);
 
[red][b]print @{$parser->{_private}->{text}};[/b][/red]
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

But now I need to make a change that I am not able to implement with this code. I need to have the parsed text returned as a string rather than an array, which is what I assume is returned by "
print @{$parser->{_private}->{text}};".

Does anyone have any idea how I would do that? Any and all tips or clues would be appreciated.
 
Unless I'm missing something, what you are asking for is rudimentary.

Code:
my $string = join '', @{$parser->{_private}->{text}};

 
Thank you. Yes, of course. But since I am completely clueless about the rest of "... @{$parser->{_private}->{text}};", I thought there might be some other way of accessing the results as a string native to the module. I should have mentioned the obvious "join" of the array.
 
you could use a string instead of an array to store the text in the hash reference and not have to use a join() function to convert the array into a string:

Code:
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
                                                                                
$parser->parse_file($file);
 
print $parser->{text};
 
sub init
{
   my ( $self ) = @_;
   $self = '';
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    $self->{text} .= $text;
}

- Kevin, perl coder unexceptional!
 
Well, after all this, I finally managed to bump into some information in my PERL CD Bookshelf 4.0. Those who have it might try looking in "The PERL Cookbook" Section 20.5.1,

Cookbook said:
If you have an external formatter like lynx, call an external program:

$ascii = `lynx -dump $filename`;
If you want to do it within your program and don't care about the things that the HTML::FormatText formatter doesn't yet handle well (tables and frames):

use HTML::FormatText 3;
$ascii = HTML::FormatText->format_file(
$filename,
leftmargin => 0, rightmargin => 50
);

So, FormatText would do the job, I bet. Unfortunately, I've tried searching for it and I can't find the module. After looking at CPAN's information on the module, I conclude there is not a release available for Windows. Since LYNX seems to be dependent on this module, I guess I can assume LYNX is not available for Windows either.

Can anyone confirm this or show me where I might get it?

HTML::parser is just not working out.
 
Problem 1 - Finding HTML::FormatText

Yes, HTML::FormatText is available for windows, it is simply in the package HTML::Format.

Whenever you can't find a module using PPM, do a search for it on CPAN. This is because some modules are part of larger packages, as is the case with this one. We find HTML::FormatText here:


You will notice in the link that it is part of HTML-Format-2.04. You therefore simply need to install this larger package to get HTML::FormatText.

Problem 2 - How to use HTML::FormatText

Well, I have never had a need for this particular functionality. Yet, I took a moment to run the quick example that they provided in their source, and it seems to work pretty well.

Code:
use HTML::FormatText;

my $text = HTML::FormatText->format_file(
    "test.html",
    leftmargin => 5, rightmargin => 50
);

print "$ascii";

Just note that this module currently only supports limitted HTML. So forms and tables and other more advanced HTML will either be ignored or maybe will throw errors.

Problem 3 - HTML::parser

Being new to Tek-tips, I only just now read this thread concerning your original project. It definitely sounds like HTML::parser was not the module that you were looking for given your needs. I'm still not familiar with exactly what your overall goal is, so I can't say whether it might have worked given more familiarity with it's functionality. But I can say that HTML::FormatText probably has a better chance of matching your requirements.

Good Luck.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top