Another regular expression issue 1

Sleidia · Jun 25, 2010

Hi,

I would like to remove any occurence of HTML comments and their surrounding HTML tags if they exist.

For example :

Code:

<!-- some text here -->

Everything should be removed from <! to -> included.

But at the same time, in

Code:

<div id="any-id-here">
<!-- some text here -->
</div>

or in

Code:

<span id="any-id-here">
<!-- some text here -->
</span>

... everything should be removed, including the surrounding tags (spans and divs here) whether or not they have an ID.

Thanks a lot to anyone who will help!

jpadie · Jun 25, 2010

this should work

Code:

$pattern = '/(<(span|div).*?>[\n ])?<!--.*?-->([\n ]<\/\2>)?/';

Sleidia · Jun 25, 2010

Thanks jpadie

I will try your idea asap but is it possible to do the same without specifying the html tag (div|span)? I mean, is it possible to make it work for any html tag?

Thanks again

jpadie · Jun 26, 2010

it probably would work but you risk false positives.
it would be better to specify the type of tag in the regex

Code:

$tags = array(   //comment these out line by line if you don't want to capture them
'div',
'span',
'td'
'li',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'legend',
'blockquote',
'a',
'caption',
'cite'
'code',
'sub',
'sup',
'small',
'strong',
'em'
);
$pattern = '/(<(' . implode ('|', $tags) .').*?>[\n ]?)?<!--.*?-->([\n ]?<\/\2>)?/';

note that this pattern does not match a tag + comment that is inside a tag with anything else. so

Code:

<div id="something"><span>some text</span><!-- a comment --></div>

will not match the whole bit, but just the comment (and delete it).

likewise this

Code:

<div id="something"><span id="something2"><!--comment--></span></div>

will only match the inside <span> and not the outer <div>

tsuji · Jun 26, 2010

The problem could have a robust solution along the line of using dom with loading method loadXML and/or loadHTML if the string is at least a well-formed xml or html.

Using a single (one-liner) regex, in cases, may not be possible. It becomes possible using regex together with some additional algorithm which effectively means "parsing" the string. Only for this latter case, the parser be a custom-made built with specific purpose in mind. Another consideration of using regex is that one has to take into account of those ignorable spaces devoid of any semantic meaning. Take for instance the particle here:
[tt].*?>[/tt]
the insertion of a line-break in front of the > is at the capprice of the author without adding or losing any "meaning" in the message (xml-wise). Those kinds of freedom have to be taken into account for a thorough use of regex to do the job. That can be taxing.

If it is most probably well-formed string, to conceive a solution, you've to define more precisely what kind of delete you want to see happen, such as ones described by jpadie, or I can coceive more such as the comment being the immediate under the root: would you want to delete together with the root resulting empty string?...

Sleidia · Jun 27, 2010

Hi again,

jpadie, that's weird but the pattern you kindly offered doesn't seem to work on my side.

Here is the PHP code (I've splitted the pattern because it breaks the code highlighter on PSpad) :

Code:

$pattern  = "/(<(span|div|h1|h2|h3|p).*?";
$pattern .= ">[\n ])?<!--.*?-->([\n ]<\/\2>)?/";
	
$code_html = preg_replace($pattern, "", $code_html);

HTML code :

Code:

<h2 class="zone-page-title"><!--{block_page:1:zone-page-title}--></h2>

I get this as a result :

Code:

<h2 class="zone-page-title"></h2>

... instead of having an empty string as expected.

Also, I didn't understand what tsuji wrote (sorry tsuji), but I would like to know if the pattern is supposed to work when linebreaks are used in such a way :

Code:

<h2 class="zone-page-title">
<!--{block_page:1:zone-page-title}-->
</h2>

Thanks again for your great help.

jpadie · Jun 27, 2010

yup. because in the original sample you had a linebreak before the comment. in this example you did not.

this might work for either scenario

Code:

$pattern  = "/(<(span|div|h1|h2|h3|p).*?>\s*)?<!--.*?-->(\s*<\/\2>)?/ims";

Sleidia · Jun 27, 2010

Thanks again jpadie, but the last pattern you gave deletes tags that should be left untouched. It seems to delete only the closing tags in that case.

As for the previous pattern, I didn't manage to make it work even with the original code sample, the one with the linebreaks.

The reason why I need this feature is because my custom cms can produce empty html elements when there is no content for them. But an empty element can take some vertical space if defined by CSS. That's why I need to get rid of them in the final HTML code.

jpadie · Jun 28, 2010

Sleidia

with this code

Code:

<?php
$text[]=<<<HTML
<div id="something"><!-- I am a comment --></div>
HTML;
$text[]=<<<HTML
<div id="something">  <!-- I am a comment -->   </div>
HTML;
$text[]=<<<HTML
<div id="something">
<!-- I am a comment -->
</div>
HTML;
$text[]=<<<HTML
<div id="something">
	<!-- I am a comment -->
	and I am some inner text
</div>
HTML;
$text[] = <<<HTML
<h2 class="zone-page-title"><!--{block_page:1:zone-page-title}--></h2>
HTML;

$tags = array(   //comment these out line by line if you don't want to capture them
'div',
'span',
'td',
'li',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'legend',
'blockquote',
'a',
'caption',
'cite',
'code',
'sub',
'sup',
'small',
'strong',
'em'
);
$pattern  = '/(<(' . implode ('|', $tags) . ').*?>\s*)?<!--.*?-->(\s*<\/\2>)?/ims';
echo '<pre>';
foreach ($text as $t){
	$_t = preg_replace($pattern, '', $t);
	echo htmlspecialchars($t) . '<br/>becomes' ;
	echo htmlspecialchars($_t) . '<br/>';
	echo '-----------------------------<br/>';
}
?>

I get the following results. there is an error with sample three that i will try to fix (actually more difficult than it looks)

is this not what you were after

Code:

<div id="something"><!-- I am a comment --></div>
becomes
-----------------------------
<div id="something">  <!-- I am a comment -->   </div>
becomes
-----------------------------
<div id="something">
<!-- I am a comment -->
</div>
becomes
-----------------------------
<div id="something">
	<!-- I am a comment -->
	and I am some inner text
</div>
becomes
	and I am some inner text
</div>
-----------------------------
<h2 class="zone-page-title"><!--{block_page:1:zone-page-title}--></h2>
becomes
-----------------------------

Sleidia · Jun 28, 2010

Hi jpadie

Thanks again for your kind efforts but it looks like I will have to find another method because a regular expression could easily break the html layout. Way too unpredictable for what I'm trying to achieve.

I'm saying this because, when I test your pattern on a whole html page instead of short code samples securely isolated in an array, it breaks the whole layout.

jpadie · Jun 28, 2010

that will be because of the issue with sample 4 above. this should be soluble in a single regex with a lookahead.

but here is a two regex solution that should work.
it will remove _all_ comments and then remove all empty tags.

Code:

<?php
$text[]=<<<HTML
<div id="something"><!-- I am a comment --></div>
HTML;
$text[]=<<<HTML
<div id="something">  <!-- I am a comment -->   </div>
HTML;
$text[]=<<<HTML
<div id="something">
<!-- I am a comment -->
</div>
HTML;
$text[]=<<<HTML
<div id="something">
	<!-- I am a comment -->
	and I am some inner text
</div>
HTML;
$text[] = <<<HTML
<h2 class="zone-page-title"><!--{block_page:1:zone-page-title}--></h2>
HTML;

$tags = array(   //comment these out line by line if you don't want to capture them
'div',
'span',
'td',
'li',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'legend',
'blockquote',
'a',
'caption',
'cite',
'code',
'sub',
'sup',
'small',
'strong',
'em'
);
$pattern[]  = '/\s*<!--.*?-->\s*/';
$pattern[] =  '/<(' . implode ('|', $tags) . ').*?><\/\1>/ims';
echo '<pre>';
foreach ($text as $key=>$t){
	foreach ($pattern as $p):
		$t = preg_replace($pattern, '', $t);
	endforeach;
	echo htmlspecialchars($text[$key]) . '<br/>becomes' ;
	echo htmlspecialchars($t) . '<br/>';
	echo '-----------------------------<br/>';
}
?>

Sleidia · Jun 28, 2010

Hi jpadie

Unfortunately, maybe I did something wrong but your code doesn't seem to work with nested tags.

So, I started a different approach, with header tags first (because non nested), and from now I will try to find a way to add the deletion of empty divs and spans as well :

Code:

    // ---------------------------------------------------------------------------

    //  remove unused code

    // ---------------------------------------------------------------------------

		// - ! - remove unused headers
		$arr["temp"] = array("h1", "h2", "h3", "h4", "h5", "h6");

        foreach ($arr["temp"] as $str["tag"]) {

		    preg_match_all("#<" . $str["tag"] . "(.+?)</" . $str["tag"] . ">#is", $code_html, $arr["matches"]);

		        foreach ($arr["matches"][0] as $str["found"]) {

								if (
		            // - ! - if no content inside tag
								!ereg("[a-zA-Z0-9]", strip_tags($str["found"]))
								) {

				        $code_html = str_replace($str["found"], "", $code_html);

								}

		        }
        
        }

			// - ! - remove unused scripts
			$code_html = preg_replace("#<!--{(.+?)}-->#is", "", $code_html);

jpadie · Jun 28, 2010

not sure what you mean by nested tags. my code first gets rid of all comments and then gets rid of all empty elements. you could keep running the code until no matches were found and that would iteratively remove all empty elements, if that were your wish.

If you provide a use case in which the code snip doesn't work it will be easier to understand what you are aiming at.

your code contains references to ereg functions, which are not favoured over preg functions. your code is also quite heavy on the processing load as you are grabbing all h elements, stripping each element's tags sequentially, testing to see whether there is any content left after stripping and then applying a find/replace.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Another regular expression issue 1

Sleidia

Technical User

jpadie

Technical User

Sleidia

Technical User

jpadie

Technical User

tsuji

Technical User

Sleidia

Technical User

jpadie

Technical User

Sleidia

Technical User

jpadie

Technical User

Sleidia

Technical User

jpadie

Technical User

Sleidia

Technical User

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor