Removing text between < > characters

Guest_imported · May 9, 2002

I wrote this to get rid of html tags in a bunch of text files that are all named similarly. It seems to access the first file and get stuck. What is wrong with my method of opening the file and reading it? I'm pretty sure there's something wrong with my second "while".

#!/usr/bin/perl -w
use strict;

undef $/; #Makes it read a file at a time instead of the default line at a time.

my $x = 0;
my $filename2 = "exxonURL" . $x;
my $dollar_one = "";

while($x <= 365) {
$filename2 = "exxonURL" . $x;
$filename2 = $filename2 . "\.htm";
open (TagDeletingStream, "> $filename2&quot

||die "Can't open file: $!";
while ( <> ) {
m/(.*)/gis;
$dollar_one = $1;
$dollar_one =~ s/<[^>]+.//gis;
print TagDeletingStream $dollar_one;
}

$x++;
close (TagDeletingStream);
}

PaulBeckett · May 10, 2002

The generally accepted convention for input / output streams is for them to be upper-case, ie. TAGDELETINGSTREAM, this will help other people debugging your code.

Why define $filename2 right at the beginning, you are just placing the exact same value into it on the first loop iteration. Why not just my localise $filename2 in the loop this would be safer and more efficient. Also where you define $filename2 in the loop why not right

Code:

$filename2 = &quot;exxonURL&quot; . $x . &quot;htm&quot;;

I don't understand why you are looping with $x, surely you only want to loop as many times as filenames are supplied!

I'm not sure what you are trying to achieve with your first reg-ex

Code:

/.*/gis

, this should be the same as simply assigning the whole variable.

I would use something like the following:

Code:

#!/usr/bin/perl -w
# Script to remove HTML tags. Beware if < > are present in text!
# ie <B>TEXT HERE</B> becomes: TEXT HERE
use strict;
my $x = 0;

foreach (@ARGV) {
  local $/= undef;               # Locally undefines $/, safer if you expand code later
  open (TDS,&quot;>exxonURL$x.htm&quot;);  # Open output file
  open (INPUT,&quot;<$_&quot;);         # Open input file
  my $whole_file = <INPUT>;      # Slurp input file
  $whole_file =~ s/<.*?>//gis;   # strip out HTML tags
  print TDS $whole_file;         # output modifed text
  close(INPUT);                  # close outfile
  $x++;                          # Increment iterator

Hope this helps,
Paul

PaulBeckett · May 10, 2002

Sorry I misssed the closing curly bracket on my previous post (mistake with cut & paste). Post should have read:

Why define $filename2 right at the beginning, you are just placing the exact same value into it on the first loop iteration. Why not just my localise $filename2 in the loop this would be safer and more efficient.
Also where you define $filename2 in the loop why not right

Code:

$filename2 = &quot;exxonURL&quot; . $x . &quot;htm&quot;;

I don't understand why you are looping with $x and the condition <=365, surely you only want to loop as many times as filenames are supplied!

I'm not sure what you are trying to achieve with your first reg-ex /.*/gis, this should be the same as simply assigning the whole variable.

I would use something like the following:

Hope this helps,
Paul

Code:

#!/usr/bin/perl -w
# Script to remove HTML tags. Beware if < > are present in text!
# ie <B>TEXT HERE</B> becomes: TEXT HERE
use strict;
my $x = 0;  # Looping variable

foreach (@ARGV) {
  local $/= undef;               # Locally undefines $/, safer if you expand code later
  open (TDS,&quot;>exxonURL$x.htm&quot;);  # Open output file
  open (INPUT,&quot;<$_&quot;);         # Open input file
  my $whole_file = <INPUT>;      # Slurp input file
  $whole_file =~ s/<.*?>//gis;   # strip out HTML tags
  print TDS $whole_file;         # output modifed text
  close(INPUT);                  # close outfile
  $x++;                          # Increment iterator
}

MikeLacey · May 10, 2002

Nope --- you're not being lazy enough, nowhere near...

Have a look at the module:

HTML:

arser

"Objects of the HTML:

arser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked." Mike
______________________________________________________________________
"Experience is the comb that Nature gives us after we are bald."

Is that a haiku?
I never could get the hang
of writing those things.

Guest_imported · May 10, 2002

Thanks, Paul, but if you run it it runs without errors but does nothing. And being that I don't know what you're doing with that <INPUT> thing I'm not sure why. Have any more helpful ideas?

Guest_imported · May 10, 2002

Never mind. Solved problem with following line
perl -i.bak -p -e 's/<.*?>/ /ig' *.htm

Found it at:

http://www.rice.edu/web/perl-edit.html

I love the internet.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing text between < > characters

Guest_imported

New member

PaulBeckett

Programmer

PaulBeckett

Programmer

MikeLacey

MIS

Guest_imported

New member

Guest_imported

New member

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing text between &lt; &gt; characters

Guest_imported

New member

PaulBeckett

Programmer

PaulBeckett

Programmer

MikeLacey

MIS

Guest_imported

New member

Guest_imported

New member

Similar threads

Log in

Part and Inventory Search

Sponsor

Removing text between < > characters