Removing HTML from a string

mountainbiker · Apr 22, 2004

Are the any good modules or nice ways to strip HTML tags, entities, multiple whitespace, and leading/trailing space from a string.

I want to take the first 1024 characters of the first paragraph of text, strip the unwanted data and stuff it into a meta element

<meta name="description" content="...." />

and an RSS element

<description>....</description>

For example, my unfiltered string is 'here is tom & jerry'. If I tried to place this into the above RSS description element above, it would cause an invalid XML format to be created.

I could wrap the string with "CDATA", but this robs me of characters I could be using in the description.

I could do a bunch of regular expressions, but I would think that this has already been tackled--so there must be a refined, robust, elegant solution.

PaulTEG · Apr 22, 2004

There's a plethora of modules on

http://search.cpan.org

HTML:

arser could be one worth looking at

HTH
--Paul

mountainbiker · Apr 28, 2004

do u know which method?

PaulTEG · Apr 29, 2004

Have a look for HTML, and XML on search.cpan.org

--Paul

youthman · Apr 29, 2004

I believe this article is doing basically the same exact thing you are wanting!!

Check it out!

http://www.perl.com/pub/a/2001/11/15/creatingrss.html

mountainbiker · Apr 29, 2004

Youthman, yes that was a very good article for the use of XML::RSS. I use Tokeparse on other things like scrapes. I currently produce my RSS channels with the latter, but was thinking of using XML::RSS:Simple because

"It transparently handles all the unpleasant details of RSS, like proper XML escaping, and also has a good number of Do-What-I-Mean features, like not changing the modtime
on a written-out RSS file if the file content hasn't changed, and like automatically removing any HTML tags from content you might pass in.

mountainbiker · Apr 30, 2004

use HTML:

arse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));

http://iis1.cps.unizar.es/Oreilly/perl/cookbook/ch20_07.htm

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing HTML from a string

mountainbiker

Programmer

PaulTEG

Technical User

mountainbiker

Programmer

PaulTEG

Technical User

youthman

Programmer

mountainbiker

Programmer

mountainbiker

Programmer

Similar threads

Part and Inventory Search

Sponsor