Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Stripping the styles and xml off WORD HTML? 1

Status
Not open for further replies.
Jul 13, 2001
180
US
Here's my scenario:

The company I work for will not purchase any other applications and does not intend on providing training for basic html, so WORD will be the best thing I can utilize for the secretaries at work who need to post pages. In essence it can be quite viable, since they already know how to use WORD quite well. So for quality control, I intend to place their WORD generated html as an include in a templated page where I can dictate all the styles.
However, I have tried all I can do to override the styles that WORD created and the best I have been able to do is change the color of the text, bold or italic. Since WORD embeds a style class on EVERY paragraph infusing an absolute font size, I cannot change it with my stlye sheets.
All I wish for is to be able to override any of the fonts, font color and font sizes that WORD has created for them in an HTML page that I will use as an include on a template.

Any help would be greatly appreciated.

Thanks,

Manny
 
*big smile*

I'm pretty sure you can do this using a global.asa file...I dont know exactly how they work, but I know my father was able to do it at the office with our secretarys...god dont you hate word?...lol...what were trying to do right now is set up an interface for people to be able to use to obtain cad standards if they need help, all of our expense reports and forms, etc...but certian things, if needing editing, were to be done by the secretarys...Hmmm...well, they dont know junk about HTML, so he did it that way... Regards,
Anth:cool:ny
----------------------------------------
"You say [red]insanity[/red] like it's a BAD THING!"
 
My company has a perl program I wrote that will strip most of the styles, etc. out of a word-generated html document. It's not a cgi program, it runs from the command line. If that will help, I'd be glad to post it (my boss told me I can offer it free to Tek-Tips members).
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
That's pretty sweet TSdragon...well worth a look.

I love it when bosses and stuff realize that using the internet for help purposes and stuff can actually help a company...as of now, I've learned more stuff and got paid for it, but theyre going to see more out of it than I'm taking out of my paycheck... Regards,
Anth:cool:ny
----------------------------------------
"You say [red]insanity[/red] like it's a BAD THING!"
 
OK, if anyone wants it, I'm posting the code. You need to have perl installed to run it, and you may need to change the top line depending on where your perl is installed. The existing top line is for the default install of ActivePerl. Remember that this is normally commercial software, being given free to Tek-Tips members, so please don't resell it. Email me if you find any problems or things it should do better/differently.
To run the program enter on the command line:
Code:
perl FixWord.pl filespec(s) [-U]
the filespecs can be one or more filenames with or without wildcards. The optional -U switch also causes the program to remove excess whitespace and newlines from the text.
Here's the code itself:
Code:
#!C:\perl\bin\perl

#---------------------------------------------------------------------
# Top Dragon FixWord
#	(c) 2001 Top Dragon Software
#	tracy@bydisn.com or [URL unfurl="true"]www.bydisn.com/software[/URL]
#---------------------------------------------------------------------

$| = 1; # flush buffer after every print
%Files = ();
$Unformat = 0;

%Symbols = (
	210 => '®',
	211 => '©',
	212 => '™',
	226 => '®',
	227 => '©',
	228 => '™',
);

# Get all arguments (filenames/filespecs/switches) from command line
foreach my $arg (@ARGV) {
	if ( $arg =~ /\A-(.+)/ ) { # if it's a switch
		if ( $1 eq "U" ) { # Unformat switch
			$Unformat = 1;
			next;
		}
	}
	if ( $arg =~ /\*/ ) { # if it's a filespec
		foreach my $file (glob $arg) { # expand filespecs
			next if ( $file =~ /\.bak\Z/ ); # ignore .bak (backup) files
			$Files{$file} = 1; # save filename as hash key (prevents duplicate filenames)
		}
	} else { # not a filespec
		next if ( $arg =~ /\.bak\Z/ ); # ignore .bak (backup) files
		$Files{$arg} = 1; # save filename as hash key (prevents duplicate filenames)
	}
}
@ARGV = (); # Remove all arguments (so perl won't see them)

# Now process the list of filenames stored as keys in the hash
foreach my $file (sort keys %Files) {
	unless ( -e "$file" ) {
		die "$0 Error\nCould not find file $file\nError: $!";
	}
	print "Processing $file\n";
	# delete backup (.bak) file if it exists
	if ( -e "$file.bak" ) {
		unlink "$file.bak" or
			die "$0 Error\nCould not delete old backup file $file.bak\nError: $!";
	}
	# rename input file to backup (.bak) file name
	rename "$file","$file.bak" or
		die "$0 Error\nCould not rename file $file to $file.bak\nError: $!";
	# fix the file
	FixFile($file);
}

exit 0; # quit
#---------------------------------------------------------------------
sub FixFile {

my($file) = @_;

undef $/; # undefine input record separator

# open the input (.bak) file
open(INFILE, &quot;<$file.bak&quot;) or
	die &quot;$0 Error\nCould not open INFILE $file.bak\nError: $!&quot;;

# open the new output file (original file name)
open(OUTFILE, &quot;>$file&quot;) or
	die &quot;$0 Error\nCould not open OUTFILE $file\nError: $!&quot;;

my $doc = <INFILE>; # slurp in entire input file

# Fix up the html
$doc =~ s{<html[^>]*?>}{<html>}sgi; # fix <html> tag
if ( $doc =~ m{<title>.*</title>} ) { # if there's a title
	$doc =~ s{<head>.*(<title>.*</title>).*</head>}{<head>$1</head>}sgi; # remove everything in <head> section but <title>
} else { # there is NO title
	$doc =~ s{<head>.*</head>}{<head></head>}sgi; # remove everything in <head> section
}
$doc =~ s{<body[^>]*?>}{<body>}sgi; # fix <body> tag
$doc =~ s{<div[^>]*?>|</div>}{}sgi; # remove <div> and </div> tags
$doc =~ s{<p[^>]*?>}{<p>}sgi; # fix <p> tags
$doc =~ s{<span[^>]*?mso-spacerun[^>]*?>}{}sgi; # remove <span ... mso-spacerun...> tags
$doc =~ s{<span[^>]*?mso-char-type:symbol[^>]*?>(.*?)</span>}{GetSymbol($+)}sgie; # fix symbol characters
$doc =~ s{<span[^>]*?>|</span>|</p>}{}sgi; # remove <span>, </span>, </p> tags
$doc =~ s{<(ol)[^>]*?>|<(ul)[^>]*?>|<(li)[^>]*?>}{<$+>}sgi; # clean up <ol>, <ul> and <li> tags
$doc =~ s{\Q<![if\E.*?\Q<![endif]>\E}{}sgi; # remove if statements
$doc =~ s{<o:p>|</o:p>}{}sgi; # remove <o:p> and </o:p> tags
$doc =~ s{<!--.*?-->}{}sgi; # remove comments

if ( $Unformat ) { # if unformat switch (-U) specified
	# Fix up formatting (spaces and linefeeds)
	$doc =~ s|&nbsp;| |sgi; # convert non-breaking spaces to plain spaces
	$doc =~ s|\xa0| |sg; # convert carriage returns to spaces
	$doc =~ s|\n{2,}|\x02|sg; # convert multiple linefeeds to single hex 02
	$doc =~ s|\n| |sg; # convert single linefeeds to spaces
	$doc =~ s| +| |sg; # convert multiple spaces to single spaces
	$doc =~ s|\x02|\n|sg; # convert hex 02 back to linefeed
}

print OUTFILE $doc; # print out the modified file

close(INFILE); # close the input file
close(OUTFILE); # close the output file

return 1;
} # FixFile
#---------------------------------------------------------------------
sub GetSymbol {

my($chars) = @_;
my $string = &quot;&quot;;

foreach my $char ( split(/ */, $chars) ) {
	if ( exists($Symbols{ord($char)}) ) {
		$string .= $Symbols{ord($char)};
	} else {
		$string .= $char;
	}
}

return $string;

} # GetSymbol
#---------------------------------------------------------------------
1;
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Thank you tsdragon , that is VERY generous of you and your boss. However, I need to find a solution that will let the secretaries be completely independent in terms of having to have their content prepared. Teaching them to use the command line would probably make me &quot;crunchy, and good with mustard.&quot; : )

But just the same. I can't extend my gratitude any more in words.

Have a great weekend!
smiletiniest.gif


Manny
 
Yah, I didn't think it would be much use in your particular case. Wish I had a word macro or something that would help, but I haven't done those in years. Maybe someone can take the above program and make a macro out of the important parts. Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Ok, tip...dont do macros...lol

At our office, anything that comes in with a macro, the macros get deleted from the file, any macros detected are looked at carefully etc...anyone can get into a system from a word macro...(stupid idiotic programers who leave holes in their code.) Regards,
Anth:cool:ny
----------------------------------------
&quot;You say [red]insanity[/red] like it's a BAD THING!&quot;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top