pkskytektip
Programmer
thread219-1220558
thread219-1303188
I am trying to apply HTML::FormatText to process HTML in the text strings. I have applied the code successfully in the included thread, but I have come against HTML pages that use non-break spaces for regular spaces in the text.
It's very strange, the non-break spaces do not show up in the HTML markup unless you view it in the original Word editor. These are files saved to HTML from Word. In the source code you can see the spaces as , but in a text editor or Dreamweaver, they do not show up. They just show up as empty spaces.
Here is the testing text:
Here is the basic FormatText application that I am experimenting with:
To run the applied code, just use "perl TestCode.pl TestCode.html" so that the tested HTML is an argument supplied to @ARGV with the above code saved as TestCode.pl and TestCode.html.
I do know that it is possible to match the nbsp using \p{Zs} in regular expressions so that the below code using HTML:arser will collapse the non-breaking spaces.
Run this the same way as above, just use "perl TestCode.pl TestCode.html" so that the tested HTML is an argument supplied to @ARGV with the above code saved as TestCode.pl and TestCode.html.
So how can I make changes to HTML::FormatText or HTML::TreeBuilder so that I can collapse these non-break spaces?
thread219-1303188
I am trying to apply HTML::FormatText to process HTML in the text strings. I have applied the code successfully in the included thread, but I have come against HTML pages that use non-break spaces for regular spaces in the text.
It's very strange, the non-break spaces do not show up in the HTML markup unless you view it in the original Word editor. These are files saved to HTML from Word. In the source code you can see the spaces as , but in a text editor or Dreamweaver, they do not show up. They just show up as empty spaces.
Here is the testing text:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "[URL unfurl="true"]http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">[/URL]
<html xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml">[/URL]
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>THIS IS A TEST.</title>
</head>
<body> <p>This is a test. 10 nbsp's follow** **This is a test. 4 nbsp's follow** **
This is another line. 24 nbsp's follow newline
** **
This is another liine. 23 nbsp's follow** **This is a test. 3 nbsp's follow** **This is a test. 16 nbsp's follow
** ** newline.
** **newline
** **newline
This is another line.</p>
</body>
</html>
Here is the basic FormatText application that I am experimenting with:
Code:
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use HTML::FormatText;
my $tree = HTML::TreeBuilder->new->parse_file(shift @ARGV) || die "Hey, where is my input file? $!\n";
my $formatter = HTML::FormatText->new( leftmargin => 0, rightmargin => 100);
my $LegTxt = $formatter->format($tree);
$tree->delete;
print $LegTxt;
To run the applied code, just use "perl TestCode.pl TestCode.html" so that the tested HTML is an argument supplied to @ARGV with the above code saved as TestCode.pl and TestCode.html.
I do know that it is possible to match the nbsp using \p{Zs} in regular expressions so that the below code using HTML:arser will collapse the non-breaking spaces.
Code:
#!/usr/bin/perl -w
use strict;
use HTML::Parser;
my $file = shift @ARGV;
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"],
start_document_h => [\&init, "self"] );
$parser->parse_file($file);
print @{$parser->{_private}->{text}};
sub init
{
my ( $self ) = @_;
$self->{_private}->{text} = [];
}
sub text_handler
{
my ( $self, $text) = @_;
$text =~ s/\p{Zs}+/ /g; # "\p{Zs}" is a "unicode property script and block" see page 121 of Mastering Reg Exp.
push @{$self->{_private}->{text}}, $text;
}
Run this the same way as above, just use "perl TestCode.pl TestCode.html" so that the tested HTML is an argument supplied to @ARGV with the above code saved as TestCode.pl and TestCode.html.
So how can I make changes to HTML::FormatText or HTML::TreeBuilder so that I can collapse these non-break spaces?