Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Changing or deleting text 1

Status
Not open for further replies.

teser

Technical User
Mar 6, 2001
194
US
How would I write something in Perl to change sentences or paragraphs in all of the web page documents. All the documents have some similiar sentences and paragraphs. On occasion I need to change or delete a specific sentence or paragraph and want to use Perl to cover all the web page documents.
 
Why not use SSi, and then just have a file you edit that is included in that document every time, so you just edit that file and all of the pages are updated in one go ;)

Andy
 
one example: remove all paragraphs, the contains
the text "old material" on all html files under
"web_pages"

$ find web_pages -name \*.html -type f |
xargs perl -i -p -e 'BEGIN{undef $/}
s/<p>[^<]*old material[^<]*//sg'

--
pkiller
 
What is SSi? I tried looking it up in my book and did not find it.
The documents I will be editing are in Adobe and word with no HTML to worry about.
 
Replacing a paragraph in an adobe or word file could be tricky, because there are likely to be formatting codes, etc. that are imbedded. If that's not really a problem, I have a program that currently processes word-to-html documents that could be modified to do what you need. Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
tsdragon,

Please provide any info on your program. I believe it could work for what I need.

Thank you
 
Also, is there anyway I can make it interactive where I would get a prompt to ask what sentence I want changed and then get another prompt of what I want to change the sentence to?
 
teser: I can email you a copy of the program if you want to see what you can do with it. It basically take one of more filespecs on the command line and expands them into a file list, copies each of the files to a .bak file before it is processed, reads the entire file into a string, then does a lot of search-and-replaces on the file. It's pretty well commented, and could be modified by a reasonably good perl programmer to do what you need. I haven't done much prompt-repsonse coding, but we ought be able to come up with something on this forum. Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Okay, Thanks.

I will see what I can do with the script.
My email is teser3@hotmail.com

I guess my overall goal is to put this on a web page where we can modify paragraphs on multiple documents and email people about changes that were made.

Thanks!
 
teser: You might consider looking into a CVS server. With that service, people are able to download documents from the main server, modify them and reupload them. The CVS server tracks the changes, when they were made, who made them, any comments the modifier had, etc. and creates continuous diff backups of the files so you can revert to a previous modification at any time. I use CVS for nearly everything I do.

tsdragon: If you've got a Perl script that actually does conversion on the MS Word format (especially using just search/replace), I'd be very interested in seeing that.

brendanc@icehouse.net
 
sophisticate,

I am not familiar with a CVS server. What is it?? I am working off a Unix server and dont have a choice for servers.

teser
 
CVS stands for &quot;Concurrent Versions System&quot; and can be found at . CVS does not require a unique platform/independent system; instead, it is simply a set of files that will run on pretty much anything -- win32, linux, unix, etc. (so you should be able to use your current server).

The documentation on the above website does a pretty good job of summing it up.

Good luck,

brendanc@icehouse.net
 
OK, since people seem interested, here the source form my MSWord-to-HTML cleanup program. It is invoked with one or more filenames and/or filespecs on the command line, and automatically creates .bak files of all files it processes. You can also add a -U switch to &quot;unformat&quot; the files (clean up multiple spaces, removes linefeeds, etc.) Let me know if you find any tags I haven't handled properly, or could handle better.

Note: this is one of my company's software products. It is being offered here free to Tek-Tips members only. Please do not resell it.

Code:
#!C:\perl\bin\perl

#---------------------------------------------------------------------
# Top Dragon FixWord
#	(c) 2001 Top Dragon Software
#	tracy@bydisn.com or [URL unfurl="true"]www.bydisn.com/software[/URL]
#---------------------------------------------------------------------

$| = 1; # flush buffer after every print
%Files = ();
$Unformat = 0;

%Symbols = (
	210 => '&reg;',
	211 => '&copy;',
	212 => '&trade;',
	226 => '&reg;',
	227 => '&copy;',
	228 => '&trade;',
);

# Get all arguments (filenames/filespecs/switches) from command line
foreach my $arg (@ARGV) {
	if ( $arg =~ /\A-(.+)/ ) { # if it's a switch
		if ( $1 eq &quot;U&quot; ) { # Unformat switch
			$Unformat = 1;
			next;
		}
	}
	if ( $arg =~ /\*/ ) { # if it's a filespec
		foreach my $file (glob $arg) { # expand filespecs
			next if ( $file =~ /\.bak\Z/ ); # ignore .bak (backup) files
			$Files{$file} = 1; # save filename as hash key (prevents duplicate filenames)
		}
	} else { # not a filespec
		next if ( $file =~ /\.bak\Z/ ); # ignore .bak (backup) files
		$Files{$arg} = 1; # save filename as hash key (prevents duplicate filenames)
	}
}
@ARGV = (); # Remove all arguments (so perl won't see them)

# Now process the list of filenames stored as keys in the hash
foreach my $file (sort keys %Files) {
	unless ( -e &quot;$file&quot; ) {
		die &quot;$0 Error\nCould not find file $file\nError: $!&quot;;
	}
	print &quot;Processing $file\n&quot;;
	# delete backup (.bak) file if it exists
	if ( -e &quot;$file.bak&quot; ) {
		unlink &quot;$file.bak&quot; or
			die &quot;$0 Error\nCould not delete old backup file $file.bak\nError: $!&quot;;
	}
	# rename input file to backup (.bak) file name
	rename &quot;$file&quot;,&quot;$file.bak&quot; or
		die &quot;$0 Error\nCould not rename file $file to $file.bak\nError: $!&quot;;
	# fix the file
	FixFile($file);
}

exit 0; # quit
#---------------------------------------------------------------------
sub FixFile {

my($file) = @_;

undef $/; # undefine input record separator

# open the input (.bak) file
open(INFILE, &quot;<$file.bak&quot;) or
	die &quot;$0 Error\nCould not open INFILE $file.bak\nError: $!&quot;;

# open the new output file (original file name)
open(OUTFILE, &quot;>$file&quot;) or
	die &quot;$0 Error\nCould not open OUTFILE $file\nError: $!&quot;;

$doc = <INFILE>; # slurp in entire input file

# Fix up the html
$doc =~ s{<html[^>]*?>}{<html>}sgi; # fix <html> tag
if ( $doc =~ m{<title>.*</title>} ) { # if there's a title
	$doc =~ s{<head>.*(<title>.*</title>).*</head>}{<head>$1</head>}sgi; # remove everything in <head> section but <title>
} else { # there is NO title
	$doc =~ s{<head>.*</head>}{<head></head>}sgi; # remove everything in <head> section
}
$doc =~ s{<body[^>]*?>}{<body>}sgi; # fix <body> tag
$doc =~ s{<div[^>]*?>|</div>}{}sgi; # remove <div> and </div> tags
$doc =~ s{<p[^>]*?>}{<p>}sgi; # fix <p> tags
$doc =~ s{<span[^>]*?mso-spacerun[^>]*?>}{}sgi; # remove <span ... mso-spacerun...> tags
$doc =~ s{<span[^>]*?mso-char-type:symbol[^>]*?>(.*?)</span>}{GetSymbol($+)}sgie; # fix symbol characters
$doc =~ s{<span[^>]*?>|</span>|</p>}{}sgi; # remove <span>, </span>, </p> tags
$doc =~ s{<(ol)[^>]*?>|<(ul)[^>]*?>|<(li)[^>]*?>}{<$+>}sgi; # clean up <ol>, <ul> and <li> tags
$doc =~ s{\Q<![if\E.*?\Q<![endif]>\E}{}sgi; # remove if statements
$doc =~ s{<o:p>|</o:p>}{}sgi; # remove <o:p> and </o:p> tags
$doc =~ s{<!--.*?-->}{}sgi; # remove comments

if ( $Unformat ) { # if unformat switch (-U) specified
	# Fix up formatting (spaces and linefeeds)
	$doc =~ s|&nbsp;| |sgi; # convert non-breaking spaces to plain spaces
	$doc =~ s|\xa0| |sg; # convert carriage returns to spaces
	$doc =~ s|\n{2,}|\x02|sg; # convert multiple linefeeds to single hex 02
	$doc =~ s|\n| |sg; # convert single linefeeds to spaces
	$doc =~ s| +| |sg; # convert multiple spaces to single spaces
	$doc =~ s|\x02|\n|sg; # convert hex 02 back to linefeed
}

print OUTFILE $doc; # print out the modified file

close(INFILE); # close the input file
close(OUTFILE); # close the output file

return 1;
} # FixFile
#---------------------------------------------------------------------
sub GetSymbol {

my($chars) = @_;
my $string = &quot;&quot;;

foreach my $char ( split(/ */, $chars) ) {
	if ( exists($Symbols{ord($char)}) ) {
		$string .= $Symbols{ord($char)};
	} else {
		$string .= $char;
	}
}

return $string;

} # GetSymbol
#---------------------------------------------------------------------
1;
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Ah. I see. I was under the impression that the code actually read in the proprietary MSWord .doc format and did a conversion. Was wondering how you got your hands on that algorithm. ;-) Ah well. Useful code, nonetheless.

Thanks.

brendanc@icehouse.net
 
Tracy,

Just caught this:
Code:
next if ( $file =~ /\.bak\Z/ );
# ignore .bak (backup) files
$file was defined with my in an earlier foreach loop and has since gone out of scope... so at this point, there is no $file. Easily rectified, though...

Another thing I saw was that in the FixFile() subroutine, you use $doc and $Unformat in the global context, while you're so good about keeping things well-scoped throughout the rest of the program.

Small things, quite a fine job.

brendanc@icehouse.net
 
Brendan,

I knew posting the program here was a good idea. Thanks for catching those. Your exactly right about the $file being out of scope. That's what happens when you cut-n-paste code sometimes. Notice that the same line is used inside the foreach loop where it is in scope. Should have noticed that. Good eye.

You're also correct that $doc should be a &quot;my&quot; variable. Just sloppy on my part. Guess I'm not in the habit of using &quot;my&quot; on file input statements like that.

As for the $Unformat being global - I guess I could have passed it as a parameter it FixFile, but since it applies to all files, and is set in the main routine, but checked in FixFile, I just made it a global. There's got to be some use for globals, and I thought this was a good place to use one. Notice that the %Symbols has is also global, even though it's only used in the GetSymbol routine and nowhere else. I made it a global in this case because it doesn't change once initialized, and so it would only be initialized once, no matter how many times GetSymbol is called. Another good use for a global IMHO.

Thanks again for catching the errors. I'll fix them.
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
I did notice the %Symbols thing, but there I agree with you -- a perfect use of a global var. I can see your point with $Unformat as well, so I'll retract that. Glad to be of help with the other items, though.

Take care,

brendanc@icehouse.net
 
Thanks for all the effort. What will this perl script do? I need an interactive one off the internet where I can make changes to sentences and/or paragraphs and email people of the changes. Will this do that? If not can I do this with CGI perl?? or do I need to do it in Cold Fusion?

Many thanks again to you guys!
 
Sorry, this script isn't interactive, it runs from the command line. It could be turned into cgi script, with input from a form, without too much trouble.

Don't know about cold fusion - I've never had the chance to use that.
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Here's the FixWord script with Brendan's suggested corrections. Again, this is one of my company's software products. It is being offered here free to Tek-Tips members only. Please do not resell it.
Code:
#!C:\perl\bin\perl

#---------------------------------------------------------------------
# Top Dragon FixWord
#	(c) 2001 Top Dragon Software
#	tracy@bydisn.com or [URL unfurl="true"]www.bydisn.com/software[/URL]
#---------------------------------------------------------------------

$| = 1; # flush buffer after every print
%Files = ();
$Unformat = 0;

%Symbols = (
	210 => '&reg;',
	211 => '&copy;',
	212 => '&trade;',
	226 => '&reg;',
	227 => '&copy;',
	228 => '&trade;',
);

# Get all arguments (filenames/filespecs/switches) from command line
foreach my $arg (@ARGV) {
	if ( $arg =~ /\A-(.+)/ ) { # if it's a switch
		if ( $1 eq &quot;U&quot; ) { # Unformat switch
			$Unformat = 1;
			next;
		}
	}
	if ( $arg =~ /\*/ ) { # if it's a filespec
		foreach my $file (glob $arg) { # expand filespecs
			next if ( $file =~ /\.bak\Z/ ); # ignore .bak (backup) files
			$Files{$file} = 1; # save filename as hash key (prevents duplicate filenames)
		}
	} else { # not a filespec
		next if ( $arg =~ /\.bak\Z/ ); # ignore .bak (backup) files
		$Files{$arg} = 1; # save filename as hash key (prevents duplicate filenames)
	}
}
@ARGV = (); # Remove all arguments (so perl won't see them)

# Now process the list of filenames stored as keys in the hash
foreach my $file (sort keys %Files) {
	unless ( -e &quot;$file&quot; ) {
		die &quot;$0 Error\nCould not find file $file\nError: $!&quot;;
	}
	print &quot;Processing $file\n&quot;;
	# delete backup (.bak) file if it exists
	if ( -e &quot;$file.bak&quot; ) {
		unlink &quot;$file.bak&quot; or
			die &quot;$0 Error\nCould not delete old backup file $file.bak\nError: $!&quot;;
	}
	# rename input file to backup (.bak) file name
	rename &quot;$file&quot;,&quot;$file.bak&quot; or
		die &quot;$0 Error\nCould not rename file $file to $file.bak\nError: $!&quot;;
	# fix the file
	FixFile($file);
}

exit 0; # quit
#---------------------------------------------------------------------
sub FixFile {

my($file) = @_;

undef $/; # undefine input record separator

# open the input (.bak) file
open(INFILE, &quot;<$file.bak&quot;) or
	die &quot;$0 Error\nCould not open INFILE $file.bak\nError: $!&quot;;

# open the new output file (original file name)
open(OUTFILE, &quot;>$file&quot;) or
	die &quot;$0 Error\nCould not open OUTFILE $file\nError: $!&quot;;

my $doc = <INFILE>; # slurp in entire input file

# Fix up the html
$doc =~ s{<html[^>]*?>}{<html>}sgi; # fix <html> tag
if ( $doc =~ m{<title>.*</title>} ) { # if there's a title
	$doc =~ s{<head>.*(<title>.*</title>).*</head>}{<head>$1</head>}sgi; # remove everything in <head> section but <title>
} else { # there is NO title
	$doc =~ s{<head>.*</head>}{<head></head>}sgi; # remove everything in <head> section
}
$doc =~ s{<body[^>]*?>}{<body>}sgi; # fix <body> tag
$doc =~ s{<div[^>]*?>|</div>}{}sgi; # remove <div> and </div> tags
$doc =~ s{<p[^>]*?>}{<p>}sgi; # fix <p> tags
$doc =~ s{<span[^>]*?mso-spacerun[^>]*?>}{}sgi; # remove <span ... mso-spacerun...> tags
$doc =~ s{<span[^>]*?mso-char-type:symbol[^>]*?>(.*?)</span>}{GetSymbol($+)}sgie; # fix symbol characters
$doc =~ s{<span[^>]*?>|</span>|</p>}{}sgi; # remove <span>, </span>, </p> tags
$doc =~ s{<(ol)[^>]*?>|<(ul)[^>]*?>|<(li)[^>]*?>}{<$+>}sgi; # clean up <ol>, <ul> and <li> tags
$doc =~ s{\Q<![if\E.*?\Q<![endif]>\E}{}sgi; # remove if statements
$doc =~ s{<o:p>|</o:p>}{}sgi; # remove <o:p> and </o:p> tags
$doc =~ s{<!--.*?-->}{}sgi; # remove comments

if ( $Unformat ) { # if unformat switch (-U) specified
	# Fix up formatting (spaces and linefeeds)
	$doc =~ s|&nbsp;| |sgi; # convert non-breaking spaces to plain spaces
	$doc =~ s|\xa0| |sg; # convert carriage returns to spaces
	$doc =~ s|\n{2,}|\x02|sg; # convert multiple linefeeds to single hex 02
	$doc =~ s|\n| |sg; # convert single linefeeds to spaces
	$doc =~ s| +| |sg; # convert multiple spaces to single spaces
	$doc =~ s|\x02|\n|sg; # convert hex 02 back to linefeed
}

print OUTFILE $doc; # print out the modified file

close(INFILE); # close the input file
close(OUTFILE); # close the output file

return 1;
} # FixFile
#---------------------------------------------------------------------
sub GetSymbol {

my($chars) = @_;
my $string = &quot;&quot;;

foreach my $char ( split(/ */, $chars) ) {
	if ( exists($Symbols{ord($char)}) ) {
		$string .= $Symbols{ord($char)};
	} else {
		$string .= $char;
	}
}

return $string;

} # GetSymbol
#---------------------------------------------------------------------
1;
Tracy Dryden
tracy@bydisn.com

Meddle not in the affairs of dragons,
For you are crunchy, and good with mustard.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top