Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing HTML from a string

Status
Not open for further replies.

mountainbiker

Programmer
Aug 21, 2002
122
GB
Are the any good modules or nice ways to strip HTML tags, entities, multiple whitespace, and leading/trailing space from a string.

I want to take the first 1024 characters of the first paragraph of text, strip the unwanted data and stuff it into a meta element

<meta name="description" content="...." />

and an RSS element

<description>....</description>

For example, my unfiltered string is 'here is tom & jerry'. If I tried to place this into the above RSS description element above, it would cause an invalid XML format to be created.

I could wrap the string with "CDATA", but this robs me of characters I could be using in the description.

I could do a bunch of regular expressions, but I would think that this has already been tackled--so there must be a refined, robust, elegant solution.




 
Have a look for HTML, and XML on search.cpan.org

--Paul
 
Youthman, yes that was a very good article for the use of XML::RSS. I use Tokeparse on other things like scrapes. I currently produce my RSS channels with the latter, but was thinking of using XML::RSS:Simple because

"It transparently handles all the unpleasant details of RSS, like proper XML escaping, and also has a good number of Do-What-I-Mean features, like not changing the modtime
on a written-out RSS file if the file content hasn't changed, and like automatically removing any HTML tags from content you might pass in.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top