Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

regular expression help

Status
Not open for further replies.

anorakgirl

Programmer
Jun 5, 2001
103
GB
I have written a PHP 'search engine' which trawls through my site and pulls out the words (+ keywords, title etc) and puts them in a database.
I want to flag certain parts of the page not to be included in the search - i.e. menu's which are repeated on each page. I have identified these in the html using comments:

<!--nosearch-->
Some Html here
<!--/nosearch-->

So I have a long string which contains all the text on the page, and I want to remove the bits between those comments. I've tried using a regular expression like this:

Code:
$text = eregi_replace("(<!--nosearch-->)(????)(<!--/nosearch-->)", " ",$text);

But I don't know what to put in the middle bit where the ???s are - it has to pick up everything (including html tags) except for the close <!--/nosearch-->.

Any tips?
Thanks!

~ ~
 
Hi,
Thanks for the suggestion. I had tried that but the problem is, there is more than one no-index section in may page:

<!--nosearch-->
Some Html here
<!--/nosearch-->
Some Html I want to index
<!--nosearch-->
Some Html here
<!--/nosearch-->

The problem is, using .*, the regular expressions seems to get rid of everything from the first <!--nosearch--> to the last <!--/nosearch--> rather than between each pair.

Not sure what I can do about that...

~ ~
 
Not sure if eregi handles minimal matching but I think php has something called "preg_replace" which is the Perl compatible Regex Replace function and I would hope that does support minimal matching.
To do a minimal match, simply swap your ".*" for ".*?" and see how you go.



Trojan.
 
Hi,
Thanks for the answer - I solved my problem a different way instead as it was hurting my head!

The code I used is:

Code:
$loop = true;

while ($loop == true) {
	$n = strpos ($text_content, "<!--nosearch-->");
	if ($n == false) {
		$loop = false;
	} else {
		$m = strpos ($text_content, "<!--/nosearch-->");
		if ($m == false) $m == strlen($text_content);
		$before = substr($text_content,0,$n);
		$after = substr($text_content,$m+16);
		$text_content = $before." ".$after;
	}
}

probably not as efficient, but it seems to work.
Thanks for the suggestion!

~ ~
 
I was wondering about trying "split" on '/<!--.?nosearch-->/'.
Just a thought.


Trojan.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top