HTML to Text with PERL. Is it possible? 4

SparceMatrix · Apr 20, 2006

I need to convert a collection of HTML files to Text files. More precisely, I need to convert .html to .txt files, but I need them parsed first so that what you see in the browser from the .html file you now see in the .txt file.

I'm already doing this with a script using MS Word's scripting interface and the Win32 module, but I'd like something that functions with a little more economy.

Is it possible?

KevinADC · Apr 20, 2006

you can use HTML:

arser and maybe HTML::SimpleParse, although I don't know if they are any more efficient than what you are currently using.

wardy66 · Apr 20, 2006

Can you "cheat" by using the Unix browsers lynx or links, I wonder?

SparceMatrix · Apr 21, 2006

Can you "cheat" by using the Unix browsers lynx or links, I wonder?

Using a browser was my first investigation into accomplishing this task. It doesn't appear possible in IE6. It wouldn't suprise me that other browsers lack similar functionality. There may be the risk of applying the browser against the client's system by writing volumes of files.

you can use HTML:arser and maybe HTML::SimpleParse, although I don't know if they are any more efficient than what you are currently using.

I have ActiveState's PERL installed on my Windows XP Pro. I don't see the SimpleParse module in my HTML collection. My documentation for Parser is a little obtuse. How would I use Parser to return a parsed .txt file?

KevinADC · Apr 21, 2006

you can install the module using the pmm program that comes with activestate perl. Go to a DOS prompt, and type ppm, after the program starts type help.

TrojanWarBlade · Apr 21, 2006

SparceMatrix,
You appear to have very little experience of browsers outside of the M$ world.
Lynx is a TEXT ONLY browser normally available for unix and linux environments. Just because IE6 does or does not do something does not mean that everything else is the same.

Obviously a .txt file is never going to have much in the way of layout control and font control available so you need to consider what you want the text files to actually look like.

Trojan.

KevinADC · Apr 21, 2006

Netscape/Mozilla and IE6 can both save as text only. I don't know what the command line options are though for doing that.

Kirsle · Apr 21, 2006

On IE6, click "File -> Save As" and in the "Save as type" listbox, click "Text file"

KevinADC · Apr 21, 2006

Yes, but thats not a command line option.

SparceMatrix · Apr 21, 2006

Yes, thanks all. Right, I know that Lynx is a text only browser and I know that you can save from IE6 using the browser. But rather obviously, I am applying these programs from their internal API in a script and I have not been able to uncover that functionality for IE6 at least and I speculate that such functionality may be vulnerable to corruption, and therefore, unavailable in browsers.

Installing the HTML:

arse module is not what I need to know. I simply don't understand how to apply it in a script to process .html files in .txt files.

KevinADC · Apr 21, 2006

use the advacned search feature of this forum and search for HTML:

arse, I am sure you will find some examples, if not post back and I will see what I can do to help if nobody else has answered.

SparceMatrix · Apr 21, 2006

OK ... I used "complete phrase", HTML:

arse.

http://www.tek-tips.com/search.cfm

Here's something interesting ...

http://www.tek-tips.com/viewthread.cfm?qid=826867

mountainbiker talks about something,

Code:

use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));

Which looks like a good clue, but if I had HTML::FormatText in my ActiveState for Windows PERL collection, I probably wouldn't have had to ask for help. He links to an O'Reilly PERL cookbook page that requires a password.

There is an available module from ActiveState for Windows:

http://aspn.activestate.com/ASPN/CodeDoc/HTML-Format/HTML/FormatText.html

I'll chew on this a bit.

Anybody have any experience with these modules applying them the way I intend?

KevinADC · Apr 21, 2006

yea, that link is to pirated material anyway, O'Reilly never allows their material to be uploaded like that. I recently wrote to the O'Reilly legal department and asked them about several websites that has much of their CD reference materials posted online and they informed me they were all infringing on O'Reilly's work.

SparceMatrix · Apr 21, 2006

Yes, I've see downloads available for some of their CD collections. It is easy to forget that music and DVDs are not the only published material that can pirated.

Any advice on applying HTML::FormatText or HTML:

arse in my particular context? FormatText looks pretty transparent, but I'm not sure how one would make Parse useful.

wray · Apr 21, 2006

IE only, the JavaScript to extract the text of an HTML document into a string variable is as simple as --

Code:

  var s = document.body.innerText;

This string can be passed through cgi to Perl for further processing. A quick test shows a cr/lf appended to the text of most individual HTML elements.

KevinADC · Apr 21, 2006

I don't take credit for this code, I got it from another forum a while back, but I just tested it and it works OK:

Code:

use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
                                                                                
$parser->parse_file($file);
 
print @{$parser->{_private}->{text}};
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

maybe you can modify this to fit your needs.

SparceMatrix · May 24, 2006

I found that the above code works exactly as posted. All you have to do is replace "index.html" with some file of your own choosing. It will print the file parsed to text in the command line window. If you want to print to a file, simply make the changes I did below:

Code:

#!/usr/bin/perl
use strict;
use HTML::Parser;
my $file = 'MyTESTHTMLFile.htm';

#### Or use this $file
# open(my $file, "<:utf8", "020419IT.htm") || die;

my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
								
my $txt = 'PerlToHTMLTEST.txt';

open ( PTHTMLTEST, '>', $txt) or die $!;
 
$parser->parse_file($file);
 
print PTHTMLTEST @{$parser->{_private}->{text}};
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

I am having one little difficulty that I cannot fathom from the documentation here:

http://aspn.activestate.com/ASPN/docs/ActivePerl/5.8/site/lib/HTML/Parser.html

The text I am getting back is including spaces that I do not see in the browser parsed HTML. For example, here is some test HTML:

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[URL unfurl="true"]http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">[/URL]
<html xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml">[/URL]
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body><CENTER><P><B>THIS IS CENTERED TEXT</B></P></CENTER>
<p>               THERE IS LEADING SPACES IN THIS MARKUP</p>
<p>THERE IS TRAILING SPACES IN THIS MARKUP                       </p>
<p>THERE ARE *                                * SPACES BETWEEN THE ASTERISKS IN THIS MARKUP </p>
</body>
</html>

Here is how the browser parses it:

Code:

[b]THIS IS CENTERED TEXT[/b]

THERE IS LEADING SPACES IN THIS MARKUP

THERE IS TRAILING SPACES IN THIS MARKUP 

THERE ARE * * SPACES BETWEEN THE ASTERISKS IN THIS MARKUP

Here is how Perl's HTML:

arser parses it as I find it in the code above:

Code:

Untitled Document


THIS IS CENTERED TEXT
               THERE IS LEADING SPACES IN THIS MARKUP
THERE IS TRAILING SPACES IN THIS MARKUP                       
THERE ARE *                                * SPACES BETWEEN THE ASTERISKS IN THIS MARKUP

What changes can I make in the HTML:

arser objects to get rid of these other spaces?

KevinADC · May 24, 2006

the spaces is where the HTML code was removed, you can collapse extra white spaces using a regexp:

s/\s+/ /g;

SparceMatrix · May 24, 2006

I'm sure that's not right. The spaces are written in between the tags like content. The parser should be able to reduce those spaces just like any browser would. Isn't that the whole point of the parser? It seems like there should be a setting that changes it.

KevinADC · May 24, 2006

as ar as I know there is no option within the module to collapse extra white-spaces in the text that gets returned. That is left up to you to do. You can do that with the regexp I posted. Maybe here:

Code:

sub text_handler
{
    my ( $self, $text) = @_;
    [b]$text =~ s/\s+/ /g;[/b] 
    push @{$self->{_private}->{text}}, $text;
}

you can always do that later once the text has been completely parsed out of the html document if the above is not correct.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

HTML to Text with PERL. Is it possible? 4

Technical User

Technical User

Programmer

Technical User

Technical User

Programmer

Technical User

Programmer

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor