Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegExp to remove entire HTML tag 4

Status
Not open for further replies.

SnaveBelac

Programmer
Oct 21, 2003
89
GB
Can someone help me construct a regexp to remove an entire HTML tag (no matter what the tag) if the tag contains a specific word?

For example

<tag option="value">some text with "word" in it</tag>

So, as this tag contains my "word" - remove the whole tag from <tag.. to ../tag>.

Any suggestions ?

Thanks in advance

----------------------------
SnaveBelac - Adventurer
----------------------------
 

$html = qq{
blah blah blah blah < tag sagsdgasdg WORD asdf adgasdgadg /tag > blah
this that this that this that< tag checkechkechke name=blah value=1 type=hidden sagsdgasdg WORD asdf adgasdgadg /tag >
after after after
};

$html =~ s/\<\s*tag.*(WORD).*\/tag\s*\>//g;

print $html;

That should remove the whole tag as many times as it
appears in the html variable
 
but the data isn't in the tag itself, it's in the value of the tag, so if $word is the word to find, and $html the HTML string:

$html =~ s|<([^>]*)>[^<]*$word[^<]*</\1>||g;

should do it.

Domster
 
I do not think it will work with this example:
[tt]<tag option="value">some <b>text</b> with "word" in it</tag>[/tt]


--------------------

Denis
 
True, I was assuming no intervening embedded tags. If there may be any, swap the [^<]*s with .*?s .

Dom
 
I don't know if there's a real good way to do this. Denis is right, there are a number of examples that the previous solutions would not work on. I'm sure there's a number of situations this won't work on either, but it seems to work reasonably well.

Code:
my $input = qq{<div><b><p>Beginning text<tag option="value">some <b>text</b> with "word" in it</tag>\nAnd here's some after</p><b>Blah Blah Blah</b><sometag option="value">Here's the \n"word" again</sometag>Here's the last bit</b></div>};
my $target = 'word';

print "Before:\n$input\n\n";
while($input =~ s/(.+)<(\w+)(?!.*?\2.*?$target).*?$target.*?\2>/$1/isg) {}
print "After:\n$input\n";
 
SnaveBelac was asking this question on my behalf, I do not need to worry about the intervening embedded tags just so long as it takes entire tags out surrounding the entire string. Although it would probably be useful to use both in the future. I basically use:
Code:
$string =~ s/<[^>]*>//g;
to remove all the tags and so need to keep other tags entact for this to work. The actual text I'm regexing is based mainly in tables, and although I've been recomended to use HTML::Tableparser I have no idea about how to use this module when a list uses two table types. For example:


I simply need to extract Names, Telephone numbers and Addresses and filter out all the other junk.
 
Code:
#!/usr/bin/perl

@strings = ('blah blah blah <tag option="value">some text with "not this line" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "word" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "not this line either" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "another word here" in it</tag> blah blah blah',
            'blah blah blah <tag option="value">some text with "ord finally not me" in it</tag> blah blah blah');

$catch = "word";

foreach $string (@strings) {

  $string =~ s/<[^>]*>[^<]*$catch[^<]*<[^>]*>//g;
  print "$string\n";

}


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top