Really Clean Up Word HTML

johnsmith98 · Jul 17, 2001

Posted to TekHints Perl forum.

I need a program that will really clean up Word HTML.

I want it to leave only the simple text tags like <html> and <body> (correctly opened and closed).

I need a program that can search and replace using wild cards.
Like Find all instances of and replace with a blank space.
Now THAT would save me a lot of time. And then I would be getting closer to cleaning up the Word HTML in a way that I don't have to Save as Text in word. I hate doing it that way because that eliminates the img tags.

I do Save as Text in Word before Save for Web because then only one style is left and I can Find Replace that easily.

Does anyone have a program like this or can you write one for me. It seems to me like it would not be difficult for someone with programming experience. A Find and Replace that can use a wild card, not a big demand is it?

First Post
Really Clean up Word HTML

I am using Word 2000 and Dreamweaver 4.

How do I take a word document that has jpg's in it and clean up all the code so the that I am left with only the text, correct links to the images, and your most basic tags like and etc.

Even after Dreamweaver's Command / Clean Up Word HTML there is still so much crap code in there.

My most successful method is to work with two versions of the html files in Dreamweaver.
1. A version of the Word files that I did Save As Text and then Save for Web (the pictures are gone) then I use find replace to clean up the code. Find Replace works relatively quickly because by converting to text there is only one Style left.

2. A version of the Word file that I only did Save for Web to (with the pictures).

Then with both windows open in Dreamweaver I combine them. This is still very time consuming.

There MUST be a better way.

I am taking Word Documents (that have pictures in them) doing Save For Web to create HTML. I then open it in Dreamweaver and I do Command / Clean Up Word HTML. This is pretty good but not good enough.

I wouldn't do this by choice, My boss gave me the Word files and said Do It.

Tom

tsdragon · Jul 17, 2001

I've got a perl program that runs from the command line and removes ALL html tags. It also cleans up non-breaking spaces, carriage returns, multiple spaces, and multiple linefeeds. I might be able to modify it to leave in a limited number of tags. It's not a really long or complicated program. BUT, you'll have to have perl installed on your computer to use it. Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.

tsdragon · Jul 17, 2001

Just for the heck of it, I'll go ahead and post the program here. It takes TWO command line parameters: input file name and output file name.

Code:

#!C:\perl\bin\perl

undef $/; # undefine input record separator
open(INFILE, &quot;<$ARGV[0]&quot;) or
	die &quot;$0 Error\nCould not open INFILE $ARGV[0]\nError: $!&quot;;
open(OUTFILE, &quot;>$ARGV[1]&quot;) or
	die &quot;$0 Error\nCould not open OUTFILE $ARGV[1]\nError: $!&quot;;
$doc = <INFILE>; # slurp in entire input file
$doc =~ s|<.*?>||sg; # strip out ALL html tags
$doc =~ s|&nbsp;| |sg; # convert non-breaking spaces to plain spaces
$doc =~ s|\xa0| |sg; # convert carriage returns to spaces
$doc =~ s|\n{2,}|\x02|sg; # convert multiple linefeeds to single hex 02
$doc =~ s|\n| |sg; # convert single linefeeds to spaces
$doc =~ s| +| |sg; # convert multiple spaces to single spaces
$doc =~ s|\x02|\n|sg; # convert hex 02 back to linefeed
print OUTFILE $doc; # print out the modified file
close(INFILE); # close the input file
close(OUTFILE); # close the output file
exit 0; # quit

Maybe others will want to offer suggested modifications too.
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.

goBoating · Jul 17, 2001

If the all you want to end up with is plain vanilla tags, then you might be able to extend the regex in this to do what you want. As written, it keeps any tag that starts with '<' or '</' followed by any of the strings in between the '|'s.

Code:

#!perl
open(IPF,&quot;<junk.html&quot;) or die &quot;Failed to open input.html\n&quot;;
while(<IPF>) { $buffer .= $_; }
close IPF;

# match each tag, send to sub routine, return good tag or null
# for replacement.
$buffer =~ s/<.*?>/&morph_tag($&)/gse;

# print new file.
open(OPF,&quot;>junk2.html&quot;);
print OPF &quot;$buffer\n&quot;;
close OPF;

sub morph_tag
{
my $tag = lc(shift);
if ($tag =~ /^<\/*img|html|head|title|body|p|table|tr|td|b|a/s) { return &quot;$tag&quot;; }
else { return ''; }
}

HTH

keep the rudder amid ship and beware the odd typo

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Really Clean Up Word HTML

johnsmith98

Instructor

tsdragon

Programmer

tsdragon

Programmer

goBoating

Programmer

Similar threads

Part and Inventory Search

Sponsor