Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing HTML tags and the code in between

Status
Not open for further replies.

glimbeek

Programmer
Nov 30, 2009
83
NL
Hi,

I have the following HTML code:
Code:
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.

After "cleaning" the code, I want to end up with the following:
Code:
A line of text, which is of any length. Ohw yes it is! With an image and a <a href="/">link</a>. With some more text.

AKA, I want to remove all the HTML tags except the link

I searched on Google, this forum and others.
I tried regular expressions combined with preg_
I tried I tried Nothing seemed to work.

In the end I ended up with the following:
Code:
$string        = $this->item->fulltext;
$search        = array('/\<(.+)>(.*)>/si'); //Strip out HTML tags
$string        = preg_replace($search, '', $string);

I do believe that using a reg express should be the way to go, but I'm struggling of coming up with one that covers everything I need.

Any help would be greatly appreciated.

With kind regards,
George
 
i have not tested this at all but try the following:

Code:
$pattern = '/(<.*?>)/imsue';
$text = <<<TEXT
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.
TEXT;
$text = preg_replace($pattern, "_replace('\\1')", $text);
echo $text;
function _replace($text){
  if (preg_match('/^<\/?a(\s|>)/imsu', $text)):
      return $text;
  else:
      return '';
  endif;
}
 
Hi jpadie,

Thank you for the reply.

I will give it a go (without using a function). What do the imsue flags do?
 
you need the function in there.

flags:

i = case insensitivity
m = multi-line (honours line-breaks)
s = dot-all (causes a dot to match a new-line character)
u = assume utf-8
e = use a php function in the replace (this is why you need the function)
manual reference
 
Thank you for linking the manual, I could not find that.
Is there a way to do it without a function? The php file gets called every time a article is being displayed. So the second time around it will try to declare the function again and it will fail. Or can I use something like declare_one function ()?

 
it would be better to put the function in a library file. however this will also work
Code:
if (!function_exists('_replace')):
function _replace($text){
  if (preg_match('/^<\/?a(\s|>)/imsu', $text)):
      return $text;
  else:
      return '';
  endif;
}
endif;
$pattern = '/(<.*?>)/imsue';
$text = <<<TEXT
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.
TEXT;
$text = preg_replace($pattern, "_replace('\\1')", $text);
echo $text;

note that the conditional and function declaration must come before the preg_replace.
 
Thinking about it...
This would also remove tags like <p> and <strong> or <b>, <ul> etc.. right?
Would it be possible to create something like:
function _replace($text,$tags)

Create an array with all the tags I do not want:
$tags = array: ('<h2>,<h3>,<img');
And then call this in the function:
function _replace($text,$tags)

Would I then need to run trough the array 1 variable at a time or is preg_match/preg_replace "smart" enough to check for all the tags in the array?
 
Sorry if this might sound daft, but isn't strip_tags() designed to do exactly that? I cannot test it right now, but I would expect that:
Code:
$text = <<<TEXT
A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.
TEXT;
echo strip_tags ($text,'<a>');
would produce expected result. Can you explain why this would not be the case?

Thanks.


[small]Do something about world cancer today: Comprehensive cancer control information at PACT[/small]
 
Hi Vragabond,

If I understand in correctly, strip_tags removes the tags but not the code between the tags.
For instance:
<h2>And a header</h2>
Will be returned as And a header. But I want it removed completely.
 
@glimbeek
in fact, the opposite is true. i had understood you wanted to keep the text between the tags. i misread. strip_tags should work
 
Now I'm confused on what I should use....

Depending on the tag I do NOT want to keep the text between the tags.
 
Not sure this is a 100% full proof, but I managed to come up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

This removes the H2 from the string
and then it removes and other tags except the once I want to be removed. The next step would be to also have it remove h3 h4 etc...
Can I put those in a array and call that array in the preg_replace?
 
Depending on the tag I do NOT want to keep the text between the tags.

then strip_tags is designed for you. there is no need to use preg_replace. if you are having trouble then it is probably due to malformed html. which is very difficult to fix dynamically.

 
If I just used
Code:
$text = strip_tags($text, '<b><i><em><strong><a>');

It would remove the H2 tag, but it would still display the content between the H2 tag AKA:
<h2>The header</h2>
Would be displayed as:
The header

Which shouldn't happen?
 
well. you learn something everyday. I thought (or at least I think i thought) that strip_tags removed the text nodes too.

ok. so you want a solution that removes the text nodes as well as the tags. that is a rather different beast.

for example in this code
Code:
<div>I am some text with a <a href="">link</a></div>

If I take you at face value that you are left with a null string. is this what you intend?

I see that the only probable solution will be a full parsing of the dom tree.
 
and even then, this will only work if the x(ht)ml is perfectly formed.

perhaps if you gave us more background as to where the data is coming from and what form it takes, we might be able to assist more accurately?
 
Well because it's tag specific and I want to keep more then I want to remove I figured this should do the trick:
Code:
$string = $this->item->fulltext;    
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

I "just" need to enhance this so it can easily remove more tags and the text nodes in those tags. For instance h3, h4 and <img.
 
**EDIT** (where's the edit button?

The <img is removed by the strip_tags...
and to the reg ex for the h3 I can just copy the h2, so I end up with:
Code:
$string = $this->item->fulltext;  
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = preg_replace('|\<h3.*\>(.*\n*)\</h3\>|isU', '', $text);
$text = strip_tags($text, '<b><i><em><strong><a>');

This should be ready to go by the looks of it.
 
I suspect those Regex will not work the way that you wish them to.
However I may be wrong so good luck anyway !
 
What might not work then?
I just want them to remove the h2 and the h3 tag and the text nodes within them.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top