Removing HTML tags and the code in between

glimbeek · Sep 9, 2010

Hi,

I have the following HTML code:

Code:

A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.

After "cleaning" the code, I want to end up with the following:

Code:

A line of text, which is of any length. Ohw yes it is! With an image and a <a href="/">link</a>. With some more text.

AKA, I want to remove all the HTML tags except the link

I searched on Google, this forum and others.
I tried regular expressions

http://en.wikipedia.org/wiki/Regular_expression

combined with preg_
I tried

http://simplehtmldom.sourceforge.net/

I tried

http://php.net/manual/en/function.strip-tags.php

Nothing seemed to work.

In the end I ended up with the following:

Code:

$string        = $this->item->fulltext;
$search        = array('/\<(.+)>(.*)>/si'); //Strip out HTML tags
$string        = preg_replace($search, '', $string);

I do believe that using a reg express should be the way to go, but I'm struggling of coming up with one that covers everything I need.

Any help would be greatly appreciated.

With kind regards,
George

jpadie · Sep 9, 2010

i have not tested this at all but try the following:

Code:

$pattern = '/(<.*?>)/imsue';
$text = <<<TEXT
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.
TEXT;
$text = preg_replace($pattern, "_replace('\\1')", $text);
echo $text;
function _replace($text){
  if (preg_match('/^<\/?a(\s|>)/imsu', $text)):
      return $text;
  else:
      return '';
  endif;
}

glimbeek · Sep 9, 2010

Hi jpadie,

Thank you for the reply.

I will give it a go (without using a function). What do the imsue flags do?

jpadie · Sep 9, 2010

you need the function in there.

flags:

i = case insensitivity
m = multi-line (honours line-breaks)
s = dot-all (causes a dot to match a new-line character)
u = assume utf-8
e = use a php function in the replace (this is why you need the function)
manual reference

glimbeek · Sep 9, 2010

Thank you for linking the manual, I could not find that.
Is there a way to do it without a function? The php file gets called every time a article is being displayed. So the second time around it will try to declare the function again and it will fail. Or can I use something like declare_one function ()?

jpadie · Sep 9, 2010

it would be better to put the function in a library file. however this will also work

Code:

if (!function_exists('_replace')):
function _replace($text){
  if (preg_match('/^<\/?a(\s|>)/imsu', $text)):
      return $text;
  else:
      return '';
  endif;
}
endif;
$pattern = '/(<.*?>)/imsue';
$text = <<<TEXT
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.
TEXT;
$text = preg_replace($pattern, "_replace('\\1')", $text);
echo $text;

note that the conditional and function declaration must come before the preg_replace.

glimbeek · Sep 9, 2010

Thinking about it...
This would also remove tags like <p> and <strong> or <b>, <ul> etc.. right?
Would it be possible to create something like:
function _replace($text,$tags)

Create an array with all the tags I do not want:
$tags = array: ('<h2>,<h3>,<img');
And then call this in the function:
function _replace($text,$tags)

Would I then need to run trough the array 1 variable at a time or is preg_match/preg_replace "smart" enough to check for all the tags in the array?

Vragabond · Sep 9, 2010

Sorry if this might sound daft, but isn't strip_tags() designed to do exactly that? I cannot test it right now, but I would expect that:

Code:

$text = <<<TEXT
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.
TEXT;
echo strip_tags ($text,'<a>');

would produce expected result. Can you explain why this would not be the case?

Thanks.

[small]Do something about world cancer today: Comprehensive cancer control information at PACT[/small]

glimbeek · Sep 9, 2010

Hi Vragabond,

If I understand in correctly, strip_tags removes the tags but not the code between the tags.
For instance:
<h2>And a header</h2>
Will be returned as And a header. But I want it removed completely.

jpadie · Sep 9, 2010

@glimbeek
in fact, the opposite is true. i had understood you wanted to keep the text between the tags. i misread. strip_tags should work

glimbeek · Sep 9, 2010

Now I'm confused on what I should use....

Depending on the tag I do NOT want to keep the text between the tags.

glimbeek · Sep 9, 2010

Not sure this is a 100% full proof, but I managed to come up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

This removes the H2 from the string
and then it removes and other tags except the once I want to be removed. The next step would be to also have it remove h3 h4 etc...
Can I put those in a array and call that array in the preg_replace?

jpadie · Sep 9, 2010

Depending on the tag I do NOT want to keep the text between the tags.

then strip_tags is designed for you. there is no need to use preg_replace. if you are having trouble then it is probably due to malformed html. which is very difficult to fix dynamically.

glimbeek · Sep 9, 2010

If I just used

Code:

$text = strip_tags($text, '<b><i><em><strong><a>');

It would remove the H2 tag, but it would still display the content between the H2 tag AKA:
<h2>The header</h2>
Would be displayed as:
The header

Which shouldn't happen?

jpadie · Sep 9, 2010

well. you learn something everyday. I thought (or at least I think i thought) that strip_tags removed the text nodes too.

ok. so you want a solution that removes the text nodes as well as the tags. that is a rather different beast.

for example in this code

Code:

<div>I am some text with a <a href="">link</a></div>

If I take you at face value that you are left with a null string. is this what you intend?

I see that the only probable solution will be a full parsing of the dom tree.

jpadie · Sep 9, 2010

and even then, this will only work if the x(ht)ml is perfectly formed.

perhaps if you gave us more background as to where the data is coming from and what form it takes, we might be able to assist more accurately?

glimbeek · Sep 13, 2010

Well because it's tag specific and I want to keep more then I want to remove I figured this should do the trick:

Code:

$string = $this->item->fulltext;    
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

I "just" need to enhance this so it can easily remove more tags and the text nodes in those tags. For instance h3, h4 and <img.

glimbeek · Sep 14, 2010

**EDIT** (where's the edit button?

The <img is removed by the strip_tags...
and to the reg ex for the h3 I can just copy the h2, so I end up with:

Code:

$string = $this->item->fulltext;  
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = preg_replace('|\<h3.*\>(.*\n*)\</h3\>|isU', '', $text);
$text = strip_tags($text, '<b><i><em><strong><a>');

This should be ready to go by the looks of it.

jpadie · Sep 14, 2010

I suspect those Regex will not work the way that you wish them to.
However I may be wrong so good luck anyway !

glimbeek · Sep 14, 2010

What might not work then?
I just want them to remove the h2 and the h3 tag and the text nodes within them.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing HTML tags and the code in between

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Technical User

Programmer

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor