Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Stripping Most HTML Tags But Not All

Status
Not open for further replies.

perlfan

Technical User
Apr 18, 2002
17
US
I'm writing a perl cgi application similar to a bulletin board. It's desired the user be able to enter simple html commands for bold text, italics, font formatting, etc. Rather than blindly accepting all html it would be best to accept a subset of innocuous html tags. What is a good way to do this? I looked at HTML::parser but got mired in the details and didn't see a straightforward solution. Any ideas or code snippets, especially code snippets, would be greatly appreciated.

I bet this topic has been kicked around ad nauseam but I couldn't find it in the FAQ or elsewhere. Forgive me if it's a tired subject.
 
I have not run this, so watch the typos. I think it illustrates using a hash of the tags you want to keep and a regex replace statement to find each tag, check if it is a keeper, and either keep it or set it to null.

Code:
#!/usr/local/bin/perl -w
use strict;
use CGI;
my $cgi = new CGI;
my $user_input = $cgi->param('user_input');

# a list of tags you want to keep
my $keep_this = ( 
    '<strong>'           => '1',
    '</strong>'          => '1',
    '<i>'                => '1',
    '</i>'               => '1',
    '<some other tags>'  => '1',
    '</some other tags>' => '1');
    
# the 'e' switch makes the replace evaluate
# the right side before doing the replacement.
# match things between <> and possibly remove them
$user_input =~ s/<.*?>/&clean($&)/gise;

sub clean
{
my $stuff = shift;
if ($keep_this{$stuff}) { return($stuff); }
else { return(); } # return nothing
}
[code] 'hope this helps

If you are new to Tek-Tips, please use descriptive titles, check the FAQs, and beware the evil typo.
 
goBoating

Thank you very very much. That is a huge help. I have one question. You use the i qualifier on the substitution. Does it get used? By that I mean the match pattern ($&) is checked for inclusion in the hash $keep_this (shouldn't that be %keep_this?). The hash values are case sensitive so <i> is not the same as <I>. It seems like I need to take $stuff and turn it all into lower cases and then check the hash.

Hope my question makes sense and thank again for the help. I'm on my way now.
 
You are correct that the 'i' is useless in this regex. Since there are no chars explicitly specified in the left side of the regex, the 'i' has no effect.

$user_input =~ s/<.*?>/&clean($&)/gse;


And, yes you are correct again. $keep_this should be %keep_this.

my [red]%[/red]keep_this = (
'<strong>' => '1',
'</strong>' => '1',
etc...


Sorry for the uh-ohs. :~/

'hope this helps

If you are new to Tek-Tips, please use descriptive titles, check the FAQs, and beware the evil typo.
 
goBoating

No need to apologize. Thanks for the help, I've got a path to go down now.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top