Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How can you collapse non-break spaces using HTML::FormatText?

Status
Not open for further replies.

pkskytektip

Programmer
Apr 3, 2010
21
US
thread219-1220558
thread219-1303188

I am trying to apply HTML::FormatText to process HTML in the text strings. I have applied the code successfully in the included thread, but I have come against HTML pages that use non-break spaces for regular spaces in the text.

It's very strange, the non-break spaces do not show up in the HTML markup unless you view it in the original Word editor. These are files saved to HTML from Word. In the source code you can see the spaces as  , but in a text editor or Dreamweaver, they do not show up. They just show up as empty spaces.

Here is the testing text:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "[URL unfurl="true"]http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">[/URL]
<html xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml">[/URL]
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>THIS IS A TEST.</title>
</head>

<body> <p>This is a test. 10 nbsp's follow**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**This is a test. 4 nbsp's follow**&nbsp;&nbsp;&nbsp;&nbsp;**

This is another line. 24 nbsp's follow newline
**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**
This is another liine. 23 nbsp's follow**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**This is a test. 3 nbsp's follow**&nbsp;&nbsp;&nbsp;**This is a test. 16 nbsp's follow
**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;** newline.
**&nbsp;**newline
**&nbsp;**newline
This is another line.</p>
</body>
</html>

Here is the basic FormatText application that I am experimenting with:

Code:
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use HTML::FormatText;

my $tree = HTML::TreeBuilder->new->parse_file(shift @ARGV) || die "Hey, where is my input file? $!\n";

my $formatter = HTML::FormatText->new( leftmargin => 0, rightmargin => 100);
my $LegTxt = $formatter->format($tree);

$tree->delete; 

print $LegTxt;

To run the applied code, just use "perl TestCode.pl TestCode.html" so that the tested HTML is an argument supplied to @ARGV with the above code saved as TestCode.pl and TestCode.html.

I do know that it is possible to match the nbsp using \p{Zs} in regular expressions so that the below code using HTML::parser will collapse the non-breaking spaces.

Code:
#!/usr/bin/perl -w
use strict;

use HTML::Parser;
my $file = shift @ARGV;

my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
                                start_document_h => [\&init, "self"] );
 
$parser->parse_file($file);
 
print  @{$parser->{_private}->{text}};

sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}
 
sub text_handler
{
    my ( $self, $text) = @_;
    $text =~ s/\p{Zs}+/ /g; # "\p{Zs}" is a "unicode property script and block" see page 121 of Mastering Reg Exp.
    push @{$self->{_private}->{text}}, $text;
}

Run this the same way as above, just use "perl TestCode.pl TestCode.html" so that the tested HTML is an argument supplied to @ARGV with the above code saved as TestCode.pl and TestCode.html.

So how can I make changes to HTML::FormatText or HTML::TreeBuilder so that I can collapse these non-break spaces?
 
I have been doing some experimentation and have tried the following solution.

Since HTML::TreeBuilder uses HTML::parser as a base, then the methods available in HTML::parser should be available to the object instance created from HTML::Builder. In fact, parse_file(shift @ARGV) is a method from HTML::parser.

So following the working example above that uses HTML::parser alone, I apply a method to use a subroutine the same way that the constructor for HTML::parser uses a over-riding subroutine. So I change the HTML::TreeBuilder example like this:

Code:
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use HTML::FormatText;

my $tree = HTML::TreeBuilder->new->parse_file(shift @ARGV) || die "Hey, where is my input file? $!\n";

[COLOR=green][b]$tree->handler( text => \&my_text_handler, "self,text" ) ;

sub my_text_handler
{
    my ( $self, $text) = @_;
    $text =~ s/\p{Zs}+/ /g; # "\p{Zs}" is a "unicode property script and block" see page 121 of Mastering Reg Exp.
    #print $text;
    #push @{$self->{_private}->{text}}, $text;
}[/b][/color]


my $formatter = HTML::FormatText->new( leftmargin => 0, rightmargin => 100);
my $LegTxt = $formatter->format($tree);

$tree->delete; 

print $LegTxt;

But this doesn't work. When debugging it, the subroutine my_text_handler is never called. Why not? Am I missing something in applying this method? Is the method not available under HTML::TreeBuilder?

Again, this code example is meant to take the sample above saved as a file as an argument on the command line: "Some Command Line>perl MyPerlCode.pl MySampleFile"

Any and all tips or clues would be appreciated.
 
I've done another experiment to see if the subroutine I am applying in the TreeBuilder code is capable of being used by HTML::parser, and it is. This works:

Code:
#!/usr/bin/perl -w
use strict;
use HTML::Parser;

my $file = shift @ARGV;

my $parser = HTML::Parser->new( start_document_h => [\&init, "self"] );

$parser->handler( text => \&my_text_handler, "self, dtext" ) ;
 
$parser->parse_file($file);

print  @{$parser->{_private}->{text}};

sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
    #my ( $self ) = @_;
    #$self = '';
}
 
sub my_text_handler
{
    my ( $self, $text) = @_;
    $text =~ s/\p{Zs}+/ /g; # "\p{Zs}" is a "unicode property script and block" see page 121 of Mastering Reg Exp.
    push @{$self->{_private}->{text}}, $text;
}

Why doesn't this work? This is just a repeat of the code in the previous post:

Code:
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use HTML::FormatText;


my $tree = HTML::TreeBuilder->new->parse_file(shift @ARGV) || die "Hey, where is my input file? $!\n";

$tree->handler( text => \&my_text_handler, "self,text" ) ;

sub my_text_handler
{
    my ( $self, $text) = @_;
    $text =~ s/\p{Zs}+/ /g; # "\p{Zs}" is a "unicode property script and block" see page 121 of Mastering Reg Exp.
    #print $text;
    #push @{$self->{_private}->{text}}, $text;
}


my $formatter = HTML::FormatText->new( leftmargin => 0, rightmargin => 100);
my $LegTxt = $formatter->format($tree);

$tree->delete; 

print $LegTxt;

That doesn't work. The non-breaking spaces are not collapsed. Why not?
 
There are some errors in the above code.

First, this cannot work:

Code:
my $tree = HTML::TreeBuilder->new->parse_file(shift @ARGV) || die "Hey, where is my input file? $!\n";

$tree->handler( text => \&my_text_handler, "self,text" ) ;

If it is to work at all, it has to be something like:

Code:
my $tree = HTML::TreeBuilder->new

$tree->handler( text => \&my_text_handler, "self,text" );

$tree->parse_file(shift @ARGV) || die "Hey, where is my input file? $!\n";

But here is another attempt to pose the problem. It does not require a file as an argument:

Code:
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use HTML::FormatText;

my $formatter = HTML::FormatText->new;
my $tree = HTML::TreeBuilder->new;

#$tree->ignore_ignorable_whitespace(0);
#$tree->no_space_compacting(0);

my $test = "<p>NON-BREAKING&nbsp;&nbsp;&nbsp;&nbsp;SPACES&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ARE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;HERE --- REGULAR      SPACES      ARE      HERE</p>";

$tree->parse($test);
$tree->eof();

my $LegTxt = $formatter->format($tree);

$tree->delete; 

print $LegTxt;

When you run this, it prints out,

Code:
NON-BREAKING    SPACES      ARE      HERE --- REGULAR SPACES ARE HERE

Notice the two commented out methods to TreeBuilder:

$tree->ignore_ignorable_whitespace(0);
$tree->no_space_compacting(0);

One might imagine they have something to do with my non-breaking spaces and I've tried different combinations of settings and nothing changes, not even for the regular spaces.

How can I alter that code to collapse these non-breaking spaces?


 
This worked:

Break into the module HTML::TreeBuilder and find "sub text { ...". In that subroutine find,

Code:
$text =~ s/[\n\r\f\t ]+/ /g  # canonical space
            unless $no_space_compacting ;

It is line 1103 in my editor. Add "\p{Zs}" so that it looks like this,

Code:
$text =~ s/[[COLOR=red][b]\p{Zs}[/b][/color]\n\r\f\t ]+/ /g  # canonical space
            unless $no_space_compacting ;

There may be another way of overriding it without breaking in, but it is beyond my abilities at this point.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top