Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HTML to Text with PERL. Is it possible? 4

Status
Not open for further replies.

SparceMatrix

Technical User
Mar 9, 2006
62
US
I need to convert a collection of HTML files to Text files. More precisely, I need to convert .html to .txt files, but I need them parsed first so that what you see in the browser from the .html file you now see in the .txt file.

I'm already doing this with a script using MS Word's scripting interface and the Win32 module, but I'd like something that functions with a little more economy.

Is it possible?
 
you can use HTML::parser and maybe HTML::SimpleParse, although I don't know if they are any more efficient than what you are currently using.
 
Can you "cheat" by using the Unix browsers lynx or links, I wonder?
 
Can you "cheat" by using the Unix browsers lynx or links, I wonder?

Using a browser was my first investigation into accomplishing this task. It doesn't appear possible in IE6. It wouldn't suprise me that other browsers lack similar functionality. There may be the risk of applying the browser against the client's system by writing volumes of files.

you can use HTML::parser and maybe HTML::SimpleParse, although I don't know if they are any more efficient than what you are currently using.

I have ActiveState's PERL installed on my Windows XP Pro. I don't see the SimpleParse module in my HTML collection. My documentation for Parser is a little obtuse. How would I use Parser to return a parsed .txt file?

 
you can install the module using the pmm program that comes with activestate perl. Go to a DOS prompt, and type ppm, after the program starts type help.
 
SparceMatrix,
You appear to have very little experience of browsers outside of the M$ world.
Lynx is a TEXT ONLY browser normally available for unix and linux environments. Just because IE6 does or does not do something does not mean that everything else is the same.

Obviously a .txt file is never going to have much in the way of layout control and font control available so you need to consider what you want the text files to actually look like.



Trojan.
 
Netscape/Mozilla and IE6 can both save as text only. I don't know what the command line options are though for doing that.
 
On IE6, click "File -> Save As" and in the "Save as type" listbox, click "Text file" :)
 
Yes, thanks all. Right, I know that Lynx is a text only browser and I know that you can save from IE6 using the browser. But rather obviously, I am applying these programs from their internal API in a script and I have not been able to uncover that functionality for IE6 at least and I speculate that such functionality may be vulnerable to corruption, and therefore, unavailable in browsers.

Installing the HTML::parse module is not what I need to know. I simply don't understand how to apply it in a script to process .html files in .txt files.
 
use the advacned search feature of this forum and search for HTML::parse, I am sure you will find some examples, if not post back and I will see what I can do to help if nobody else has answered.
 
OK ... I used "complete phrase", HTML::parse.


Here's something interesting ...


mountainbiker talks about something,

Code:
use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));

Which looks like a good clue, but if I had HTML::FormatText in my ActiveState for Windows PERL collection, I probably wouldn't have had to ask for help. He links to an O'Reilly PERL cookbook page that requires a password.

There is an available module from ActiveState for Windows:


I'll chew on this a bit.

Anybody have any experience with these modules applying them the way I intend?
 
yea, that link is to pirated material anyway, O'Reilly never allows their material to be uploaded like that. I recently wrote to the O'Reilly legal department and asked them about several websites that has much of their CD reference materials posted online and they informed me they were all infringing on O'Reilly's work.
 
Yes, I've see downloads available for some of their CD collections. It is easy to forget that music and DVDs are not the only published material that can pirated.

Any advice on applying HTML::FormatText or HTML::parse in my particular context? FormatText looks pretty transparent, but I'm not sure how one would make Parse useful.
 
IE only, the JavaScript to extract the text of an HTML document into a string variable is as simple as --
Code:
  var s = document.body.innerText;
This string can be passed through cgi to Perl for further processing. A quick test shows a cr/lf appended to the text of most individual HTML elements.
 
I don't take credit for this code, I got it from another forum a while back, but I just tested it and it works OK:

Code:
use strict;
use HTML::Parser;
my $file = 'index.html';
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
                                                                                
$parser->parse_file($file);
 
print @{$parser->{_private}->{text}};
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

maybe you can modify this to fit your needs.
 
I found that the above code works exactly as posted. All you have to do is replace "index.html" with some file of your own choosing. It will print the file parsed to text in the command line window. If you want to print to a file, simply make the changes I did below:

Code:
#!/usr/bin/perl
use strict;
use HTML::Parser;
my $file = 'MyTESTHTMLFile.htm';

#### Or use this $file
# open(my $file, "<:utf8", "020419IT.htm") || die;

my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
								
my $txt = 'PerlToHTMLTEST.txt';

open ( PTHTMLTEST, '>', $txt) or die $!;
 
$parser->parse_file($file);
 
print PTHTMLTEST @{$parser->{_private}->{text}};
 
sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;
}

I am having one little difficulty that I cannot fathom from the documentation here:


The text I am getting back is including spaces that I do not see in the browser parsed HTML. For example, here is some test HTML:

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[URL unfurl="true"]http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">[/URL]
<html xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml">[/URL]
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body><CENTER><P><B>THIS IS CENTERED TEXT</B></P></CENTER>
<p>               THERE IS LEADING SPACES IN THIS MARKUP</p>
<p>THERE IS TRAILING SPACES IN THIS MARKUP                       </p>
<p>THERE ARE *                                * SPACES BETWEEN THE ASTERISKS IN THIS MARKUP </p>
</body>
</html>

Here is how the browser parses it:

Code:
[b]THIS IS CENTERED TEXT[/b]

THERE IS LEADING SPACES IN THIS MARKUP

THERE IS TRAILING SPACES IN THIS MARKUP 

THERE ARE * * SPACES BETWEEN THE ASTERISKS IN THIS MARKUP

Here is how Perl's HTML::parser parses it as I find it in the code above:

Code:
Untitled Document


THIS IS CENTERED TEXT
               THERE IS LEADING SPACES IN THIS MARKUP
THERE IS TRAILING SPACES IN THIS MARKUP                       
THERE ARE *                                * SPACES BETWEEN THE ASTERISKS IN THIS MARKUP

What changes can I make in the HTML::parser objects to get rid of these other spaces?
 
the spaces is where the HTML code was removed, you can collapse extra white spaces using a regexp:

s/\s+/ /g;

 
I'm sure that's not right. The spaces are written in between the tags like content. The parser should be able to reduce those spaces just like any browser would. Isn't that the whole point of the parser? It seems like there should be a setting that changes it.
 
as ar as I know there is no option within the module to collapse extra white-spaces in the text that gets returned. That is left up to you to do. You can do that with the regexp I posted. Maybe here:

Code:
sub text_handler
{
    my ( $self, $text) = @_;
    [b]$text =~ s/\s+/ /g;[/b] 
    push @{$self->{_private}->{text}}, $text;
}


you can always do that later once the text has been completely parsed out of the html document if the above is not correct.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top