Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Strip HTML from field in table - SQL Server 2005

Status
Not open for further replies.

pbsadmin

Technical User
Aug 19, 2009
18
0
0
US
Hello,

we are running SQL 2005 and I was wondering what regular expressions could be used to encapsulate everything between the "<" and ">" in HTML tags. I have been trying to use the following code, but it only strips the "<" opening tag leaving "FONT face=Arial size=2>" and "/FONT>" for instance.

the code is below.


USE [VISION_TEST]

UPDATE CL_Backup set Memo = REPLACE(Memo,substring(Memo,PATINDEX('%<[A-Za-z0-9]%>%',Memo), 1),'') WHERE PATINDEX('%<[A-Za-z0-9]%>%', Memo) < > 0


thanks in advance,

 
SQL Server doesn't directly support regular expressions. With SQL2005, you could enable CLR and write an HTML stripper using a .net language for the CLR routine. If never done this so I won't be much help.

Removing HTML from a string is probably more difficult than it seems, especially since most (if not all) browsers all for sloppy HTML. IF you have well formed html, then the following function may actually work ok for you. You'll want to test this a lot to see if it does.

Code:
Create Function dbo.StripHTML(@HTML VarChar(max))
Returns VarChar(Max)
As
Begin

  While @HTML Like '%[<]%'
    Set @HTML = Stuff(@HTML, PatIndex('%[<]%', @HTML), CharIndex('>', @HTML, PatIndex('%[<]%', @HTML)), '')

  Return @HTML
End

You'll definitely want to test this before you update any data. Test it like this:

[tt][blue]
Select Memo, dbo.StripHTML(Memo) As CleanedUpMemo
From CL_Backup
[/blue][/tt]

Do NOT be surprised if this doesn't catch all of the HTML. I mean... I hope it does, but I don't make any guarantees either.

The reason I decided to make this a function is to handle multiple HTML tags within a single string. I tested it with this...

Code:
Select dbo.StripHTML('<FONT Face=Arial size=2>This is text</font>')
Select dbo.StripHTML('')
Select dbo.StripHTML('Plain Text')
Select dbo.StripHTML(NULL)

It seems to pass all of my tests. Good luck.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thanks for the quick reply George, unfortunately we have "sloppy" HTML imported from an older program no longer in use.

your function works well, however, the "&nbsp"s still remain,



here is one example before running the function:

<FONT face=Tahoma> <P><STRONG>Quick Reference:</STRONG><BR>Accounting Services&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


and after:

TRONG>Quick Reference:nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

(there was a space included before the "TRONG")


Also, do you know the syntax to update the data in the Memo column of table CL_Backup using your function? I'm new to SOL and didn't quite know the syntax

thanks,

-Andy
 
I made a mistake in the original function that I posted. Sorry. Here is a corrected version.

Code:
Create Function dbo.StripHTML(@HTML VarChar(max))
Returns VarChar(Max)
As
Begin

  While @HTML Like '%[<]%'
    Set @HTML = Stuff(@HTML, PatIndex('%[<]%', @HTML), CharIndex('>', @HTML, PatIndex('%[<]%', @HTML))-PatIndex('%[<]%', @HTML)+1, '')

  Set @HTML = Replace(@HTML, ' ', ' ')
  Set @HTML = Replace(@HTML, '<', '<')
  Set @HTML = Replace(@HTML, '>', '>')
  Set @HTML = Replace(@HTML, '"', '"')
  Set @HTML = Replace(@HTML, '&', '&')

  Return LTrim(RTrim(@HTML))
End

Notice the part near the end: Set @HTML = Replace(@HTML, '&nbsp', ' '). This will replace all occurrences of   with a space character. HTML has codes for other characters too, like <, >, ", © etc... you'll probably want to find a list of this and modify the code to include them.

Before you update your table, you'll want to verify that it is working properly be selecting the data.

[tt][blue]
Select Memo, dbo.StripHTML(Memo) As CleanedUpMemo
From CL_Backup
[/blue][/tt]

If this appears to work properly, then you can update the data in your table with this:

Code:
Update CL_Backup 
Set    Memo = dbo.StripHTML(Memo)


-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
The replaces near the end of the code got messed up. This should be right.

Code:
Create Function dbo.StripHTML(@HTML VarChar(max))
Returns VarChar(Max)
As
Begin

  While @HTML Like '%[<]%'
    Set @HTML = Stuff(@HTML, PatIndex('%[<]%', @HTML), CharIndex('>', @HTML, PatIndex('%[<]%', @HTML))-PatIndex('%[<]%', @HTML)+1, '')

  Set @HTML = Replace(@HTML, '&nbsp;', ' ')
  Set @HTML = Replace(@HTML, '&lt;', '<')
  Set @HTML = Replace(@HTML, '&gt;', '>')
  Set @HTML = Replace(@HTML, '&quot;', '"')
  Set @HTML = Replace(@HTML, '&amp;', '&')

  Return LTrim(RTrim(@HTML))
End

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
thanks a lot George, your revised function works perfectly
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top