Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

REGEX Help Needed 2

Status
Not open for further replies.

MrTrue

Technical User
Jul 28, 2008
46
US
First I'll provide a description of what I'm trying to do.
When I send an outbound email I want to screen the body of the message and remove any SS# information. So far I think I have this under control with the exception of some of the REGEX patterns I'm trying to find... I've been researching and trying different combinations for the last couple of days and I'm getting to the point where I think I need someone elses input.

Basically I'm looking for patterns in a String that I've pulled from the HTMLBody of the email. The patterns would include standard digit formatting for SS# such as:
xxx-xx-xxxx or xxx xx xxxx. I'm also looking for 9 consecutive digits as long as the first digit is not a "9", so 887654321 would qualify, but 987654321 would not. I'm also looking for any 8 digit consecutive 98765432...

The problem I encounter is that the string is pulling in HTML tags (which I want), so the sequence may look like "<tag>87654321</tag>" or there could be other non-numeric values next to the digit sequence...

I can't figure out the REGEX that I need to use to find 8 digits regarless of leading or trailing (as long as it's not another digit, I don't want to pull an 8 digit number out of a 10 digit number, etc.)

I've tried [0-9]{8} but this will extract 8 digit numbers from larger number sequences, and I've tried ^\d{5}$ but this will not locate the sequences due to leading and trailing characters...

I've placed the code below for those who are interested, if anyone can offer any assistance on building my ".pattern" line with REGEX, please feel free to help! Thanks everyone for reading this!

Code:
Private Sub Application_ItemSend(ByVal Item As Object, Cancel As Boolean)
  Dim Itm As outlook.MailItem
        If TypeName(Item) <> "MailItem" Then
         Exit Sub
        Else
         Set Itm = Item
         Itm.HTMLBody = StripPHIFromText(Itm.HTMLBody)
        End If      
End Sub

Function StripPHIFromText(ByVal RTFString As String) As String
 Dim RegEx As Object
 Set RegEx = CreateObject("vbscript.regexp")
 With RegEx
  .Global = True
  .IgnoreCase = True
  .MultiLine = True
  .Pattern = "[0-9]{3}[ ][0-9]{2}[ ][0-9]{4}|[0-9]{3}[-][0-9]{2}[-][0-9]{4}|[0-8]{1}[0-9]{8}"
 End With
 '--------------------------------------
 Dim NameFile As String
 Dim CENumber As String
           NameFile = "C:\Documents and Settings\t329323\Desktop\emailinfo.doc"
           Open NameFile For Output As #1
           Write #1, RTFString
           Close #1
'---------------------------------------
 StripPHIFromText = RegEx.Replace(RTFString, "(PHI DELETED)")
 Set RegEx = Nothing
End Function
 
Will the digits always be between HTML tags like "<tag>87654321</tag>"? If so, could you use:

>[0-9]{8}<
 
Or maybe this will work for you if there will be other characters besides digits:

>([^0-9])*[0-9]{8}([^0-9])*<
 
Thanks Mark, those patterns do find the numbers I'm looking for, but now I have a slightly different issue... :)

How can I make everything up to the 8 digits apply only for the pattern recognition, but exempt the rest of it when it comes to the replace?

I don't want to replace the >< or any text that leads or trails the 8 digit number...

Is there something that can be added to portions of those expressions that identifys certain pieces as "For matching purpose only." ? While the 8 digit number can be replaced?
 
Put them in a passive group with (?: )

>(?:[^0-9]*)([0-9]{8})(?:[^0-9]*)<


This alternate will find an 8 digit sequence with or without characters, including digits, on either side(but no digits directly before or after the 8 digits):

>(?:.*[^0-9]+)([0-9]{8})(?:[^0-9]+.*)<

 
Make that

>(?:.*[^0-9]+)*([0-9]{8})(?:[^0-9]+.*)*<

for the alternate
 
[1] Look-arounds are not fully supported on the platform of vb/vba/vbs. I would suggest this, valid both in concept and in practice, most probably.

[1.1] the pattern (including the treatment of the 8-digit question)

[tt] .Pattern = "[blue](\D|^)([/blue][0-9]{3}[ ][0-9]{2}[ ][0-9]{4}|[0-9]{3}[-][0-9]{2}[-][0-9]{4}|[0-8]{1}[0-9]{8}[blue]|[0-9]{8})[/blue][blue](\D|$)[/blue]"[/tt]

[1.2] the replace

[tt] StripPHIFromText = RegEx.Replace(RTFString, "[blue]$1[/blue](PHI DELETED)[blue]$3[/blue]")[/tt]

[2] The only problem you still have is the string of digits etc. spanned across more than one line. But, hopefully you exclude those difficult possibilities.
 
Thank you both! :)

tsuji - You're solution worked perfectly in VBA! Thanks so much!

Mark -Thanks for the REGEX examples, for some reason they were acting strange in VBA, but it was helpful for me to see some properly constructed REGEX examples. The constructs are definately making more sense to me now!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top