RegExp to remove entire HTML tag 4

SnaveBelac · May 25, 2004

Can someone help me construct a regexp to remove an entire HTML tag (no matter what the tag) if the tag contains a specific word?

For example

<tag option="value">some text with "word" in it</tag>

So, as this tag contains my "word" - remove the whole tag from <tag.. to ../tag>.

Any suggestions ?

Thanks in advance

----------------------------
SnaveBelac - Adventurer
----------------------------

h011ywood · May 25, 2004

$html = qq{
blah blah blah blah < tag sagsdgasdg WORD asdf adgasdgadg /tag > blah
this that this that this that< tag checkechkechke name=blah value=1 type=hidden sagsdgasdg WORD asdf adgasdgadg /tag >
after after after
};

$html =~ s/\<\s*tag.*(WORD).*\/tag\s*\>//g;

print $html;

That should remove the whole tag as many times as it
appears in the html variable

domster · May 25, 2004

but the data isn't in the tag itself, it's in the value of the tag, so if $word is the word to find, and $html the HTML string:

$html =~ s|<([^>]*)>[^<]*$word[^<]*</\1>||g;

should do it.

Domster

dchoulette · May 27, 2004

I do not think it will work with this example:
[tt]<tag option="value">some <b>text</b> with "word" in it</tag>[/tt]

--------------------

Denis

domster · May 27, 2004

True, I was assuming no intervening embedded tags. If there may be any, swap the [^<]*s with .*?s .

Dom

rharsh · May 27, 2004

I don't know if there's a real good way to do this. Denis is right, there are a number of examples that the previous solutions would not work on. I'm sure there's a number of situations this won't work on either, but it seems to work reasonably well.

Code:

my $input = qq{<div><b><p>Beginning text<tag option="value">some <b>text</b> with "word" in it</tag>\nAnd here's some after</p><b>Blah Blah Blah</b><sometag option="value">Here's the \n"word" again</sometag>Here's the last bit</b></div>};
my $target = 'word';

print "Before:\n$input\n\n";
while($input =~ s/(.+)<(\w+)(?!.*?\2.*?$target).*?$target.*?\2>/$1/isg) {}
print "After:\n$input\n";

waiterm · May 27, 2004

SnaveBelac was asking this question on my behalf, I do not need to worry about the intervening embedded tags just so long as it takes entire tags out surrounding the entire string. Although it would probably be useful to use both in the future. I basically use:

Code:

$string =~ s/<[^>]*>//g;

to remove all the tags and so need to keep other tags entact for this to work. The actual text I'm regexing is based mainly in tables, and although I've been recomended to use HTML::Tableparser I have no idea about how to use this module when a list uses two table types. For example:

http://www.scoot.co.uk/business/res...ers&a=27260&ae=AL1&s=V2C&cp=&sid=&sno=&st=1&.

I simply need to extract Names, Telephone numbers and Addresses and filter out all the other junk.

duncdude · May 28, 2004

Code:

#!/usr/bin/perl

@strings = ('blah blah blah <tag option="value">some text with "not this line" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "word" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "not this line either" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "another word here" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "ord finally not me" in it</tag> blah blah blah');

$catch = "word";

foreach $string (@strings) {

  $string =~ s/<[^>]*>[^<]*$catch[^<]*<[^>]*>//g;
  print "$string\n";

}

Kind Regards
Duncan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

RegExp to remove entire HTML tag 4

SnaveBelac

Programmer

h011ywood

Programmer

domster

Programmer

dchoulette

Programmer

domster

Programmer

rharsh

Technical User

waiterm

Programmer

duncdude

Programmer

Similar threads

Part and Inventory Search

Sponsor