Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regular expression 1

Status
Not open for further replies.

bslintx

Technical User
Apr 19, 2004
425
US
Fellow members,

I have to parse data from a .txt file generated by an application. (line-by-line)
Unfortunately, it does not have a strict structure like .xml so I have to check
if it satisfies a particular pattern - thus I chose to use a regular expression.

The pattern is (pseudo):

(The characters IP:) (a space) (An ip address) (a space)
(The characters MAC:) (a space) (a mac address *** either a mac address OR the word 'unspecified') (a space)
(The characters Host Name:) (a space) (computer name *** follows active dir rules IE: min 2 characters max 24 - min i need)
(a space)
(Line break)


Examples that satisfy the pattern: * ips/macs masked for obvious reasons (those patterns seem to work)..problem seems to be with host name

IP: 127.0.0.57 MAC: 00:00:00:00:0C:D1 Host Name: Foobar-12345
IP: 127.0.0.58 MAC: unspecified Host Name: Foobar-12345678
IP: 127.0.0.59 MAC: unspecified Host Name: unspecified


Here is a pattern i tried to use...it works for the most part, however, it does not pick all badly generated output

IP: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} MAC: (((?:(\d{1,2}|[a-fA-F]{1,2}){2})(?::|-*)){6})|(unspecified) Host Name: \w


Examples picked out by the above regex pattern:

1.(Bad ip address) IP: 1277.0.0.139 MAC: 00:00:00:a0:0B:50 Host Name: ABCDEFGHIJKL
2.(Bad ip address) IP: 127.0.0..52 MAC: 00:24:21:84:1E:B4 Host Name: MNBVCXZASDDR
3.(Bad ip chars) IP:: 127.0.0.23 MAC: 00:24:21:7B:AA:99 Host Name: HGFDRTHUIKJHG

3. (Bad mac address) IP: 127.0.0.51 MAC: 00:00::00:0F:0B:AD Host Name: GHIOPTREDCGM
4. (Bad mac chars) IP: 127.0.0.12 MMAC: 00:1F:0C:14:24:02 Host Name: JTRESWQRAFSS

Examples NOT picked out by the expression:
1. (Bad host chars) IP: 127.0.0.139 MAC: 00:00:00:a0:0B:50 Host Namee: ABCDEFGHIJKL
2. (Bad host chars) IP: 127.0.0.139 MAC: 00:00:00:a0:0B:50 Hosst Name: ABUUUHIJKL


note that i did not add a \n in the expression as it did not work but the lines end with a vbcrlf...same
with trying to ensure the min maximum charaters match like: w\{1,24} \n....this may let more get by


if one could help with getting ONLY the (pseudo) pattern to get by ...it would greatly be appreciated



Thanks!
 
[0]
[tt] ^\s*(IP: \d{1,3}(\.\d{1,3}){3} MAC: (unspecified|[A-Fa-f0-9]{2}:)[A-Fa-f0-9]{2}){5}) Host Name: (\w|-){2,24})\s*$[/tt]

[1] \d for the ip address part can/should be expanded, if needed, to allow a finer grained matching per rfc for ip.

[2] Similarly the (\w|-) in the host name part can/should be expanded to allow a finer grained matching in according with the ad naming convention.
 
tsujii,

Fistly, thanks for taking the time to help out.

I applied your pattern and it had the same results as my original string. Because of this I figured it had to be a logical error, therefore I created a small sub routine to test out the pattern...and sure enough - I found a logical error. I ended up using your pattern for all (ip/mac/host) as it is cleaner and more precise. In particular, the host. This was my request and it has been completely satisified - thank you!

i have always been intrigued with regular expressions, however, i have to admit i have shyed away from them. but,
i have a couple questions in regards to the following snippet (the test script to validate the pattern you gave me)...

* IP:, MAC:, Host Name: were replaced with delimited commas to eventually be imported into a SQL database - hence the slight change

Code:
 Function IsValidRegEx(p,s)
  Set regEx = New RegExp
  regEx.Pattern = p  
  regEx.IgnoreCase = False
  IsValidRegEx = regEx.Test(s)
 End Function

 Sub ParseReport
  Set oFSO = CreateObject("Scripting.FileSystemObject")
  Set oFolder = oFSO.GetFolder(".")

  Set oFile = oFSO.OpenTextFile("report.txt", 1)
  ReportFile = oFile.ReadAll
  oFile.Close

  aReportFile = Split(ReportFile,vbcrlf)

  ReportSaveAs = "TEST Report-" & MonthName(Month(date),true) & "-" &  Day(date) & "-" & Year(date) & ".txt"
  
  Set Parse_DETAILS = oFSO.CreateTextFile(".\Reports\" & ReportSaveAs, True)

  CONST SHORTCIRCUIT = 5
  CONST DETAILSPATTERN = "^\s*(\d{1,3}(\.\d{1,3}){3},(unspecified|[A-Fa-f0-9]{2}(:[A-Fa-f0-9]{2}){5}),(\w|-){2,16})\s*$"

  For i = 0 to UBound(aReportFile) 
   
   If INStr(aReportFile(i),"IP:") > 0 Then

    DETAILS = LTrim(aReportFile(i))

    k = 1  

    DO 

     DETAILS = DETAILS & LTrim(aReportFile(i+k))  
     k = k + 1       

    Loop until IsValidRegEx("((N|NN)(a|aa)(m|mm)(e|ee): (\w|-){2,16})\s*$",DETAILS) OR SHORTCIRCUIT

    DETAILS = Replace((Replace(Replace(Replace(DETAILS,"IP: ","")," MAC: ",",")," Host Name: ",",")), " ","")

    Parse_DETAILS.WriteLine "Details: " & IsValidRegEx(pattern,DETAILS) & vbTab & DETAILS
   End If
  Next

 End Sub

 Call ParseReport

 msgbox("Done")

Here's the logic i had intended to use:

The 411 -
1. the report consists of thousands of lines of code (multiple files)
2. i used fso to 'snapshot' the file(s) using a loop (not shown in demo) - readall per fso file found
3. i iterate through the read.all array(s) and look for key words - for simplicity sakes, it's IP: as represented in demo
4. i use a do loop to create a string until it matches a pattern - in this case it's Name:

the ouput sometimes repeats the Name pattern so i used a regular expression to pick up abnormalties - Naame, Namme, etc

i use a do loop because line-by-line will NOT pick up Name: Sometimes Name: (and almost always, is on a different line)

IE:

IP: 127.0.0.1 MAC: 00:00:00:00:00:01 Host
Name: ABCDEFGHI
IP: 127.0.0.2 MAC: 00:00:00:00:00:02 Host Name:
ABVCDFGS
IP: 127.0.0.3 MAC: 00:00:00:00:00:03 Ho
st Name: GHTSEDFGSS


' etc

So, am i going about it the right way? or is there better logic? can a regular expression by itself (no loop) take care of this situation? regex.multiline?

to you and othere i leave the question - again thanks
 
[3] If the data file looks that bad with "host name" can switch line any time with space or tab, you've to deal with it accordingly. But the way you deal with it in the post looks...no less "bad". The i+k can get any time out of range... the loop condition SHORTCIRCUIT (I suppose some type of incomplete use of SHORTCIRCUIT like k<SHORTCIRCUIT etc).

[4] I would say do it like this: no split, no non-sense.
[tt]
Sub ParseReport
[blue]
dim a(6), rx, soutput
a(0)="((\s|\r\n)*)"
a(1)="((IP: )(\d{1,3}(\.\d{1,3}){3}))"
a(2)="( )"
a(3)="((MAC: )(unspecified|[A-Fa-f0-9]{2}:)[A-Fa-f0-9]{2}){5}))"
a(4)="((\s|\r\n)+)"
a(5)="((H|HH)(\s|\r\n)*(o|oo)(\s|\r\n)*(s|ss)(\s|\r\n)*(t|tt)(\s|\r\n)+(N|NN)(\s|\r\n)*(a|aa)(\s|\r\n)*(m|mm)(\s|\r\n)*(e|ee)(\s|\r\n)*:(\s|\r\n)*((\w|-){2,24}))"
a(6)="((\s|\r\n)*)"
set rx=new regexp
with rx
.ignorecase=true 'more relax or false if you're sure
.global=true
.pattern=join(a,"")
end with
[/blue]
Set oFSO = CreateObject("Scripting.FileSystemObject")
[red]'[/red]Set oFolder = oFSO.GetFolder(".")

[blue]'suppose existence and non-empty[/blue]
Set oFile = oFSO.OpenTextFile("report.txt", 1)
ReportFile = oFile.ReadAll
oFile.Close

ReportSaveAs = "TEST Report-" & MonthName(Month(date),true) & "-" & Day(date) & "-" & Year(date) & ".txt"

[blue]
if rx.test(ReportFile) then[/blue]

Set Parse_DETAILS = oFSO.CreateTextFile(".\Reports\" & ReportSaveAs, True)
Parse_DETAILS.WriteLine "Details: " & IsValidRegEx(pattern,DETAILS) & vbTab & DETAILS
[blue]
set cm=rx.execute(ReportFile)
for each m in cm
soutput=m.submatches(4) & "," & m.submatches(9) & "," & m.submatches(31)
Parse_DETAILS.WriteLine soutput
next
'and I can assure you that you still have time to do some cleanup and a lot of mannerism for nothing that I do
Parse_DETAILS.close
Set Parse_DETAILS=nothing
end if
Set oFile=nothing
Set oFSO=nothing
[/blue]
End Sub

Call ParseReport

msgbox("Done")
[/tt]
 
thanks for the invaluable input tsuji...

i have not applied your logic yet...it seems to have a nice precision to it and i will certailnly apply accordingly...

1. i took out the SHORTCIRCUIT as i was getting erroneous results...it was only there for i feared an infinite loop. but, it looks like it is not needed - assuming the regex condition will always be caught in this scenario

2. i have supplied (posted below) the main sub with functions - although i am very satisfied with the results thus far it is extremely slow...

perhaps i have a bottelneck somewhere? if you have the time i'd appreciate if you could spot check. it's not that big of a deal because it's going to be fired off by a scheduling event on a daily basis - i was more worried if i overlooked the obvious - thanks

3. i have 150sh that did not make it through the details filter (regex pattern you supplied)...not bad considering out of 600,000+...i feel confident the remaining 150 can be remedied by yet another regular expression as they have one thing in common - repetitive chars

the do loop does it's job as predicted, however, there are what i call hiccups(unknown bad output from app output) that come along with the concatenated string..

for example:

#1 IP Address (1277.0.0.0)
IP: 127
7.0.0.0 MAC: 00:00:00:00:00:00 Host
Name: ABCDE

Loop Details value = IP: 1277.0.0.0 MAC: 00:00:00:00:00:00 Host Name ABCDE
-------------------------
#1 MAC Address (00:0AA:00:00:00:00)
IP: 127.0.0.0 MAC: 00:0A
A:00:00:00:00 Host Name: ABCDE

Loop Details value = IP: 127.0.0.0 MAC: 00:0AA:00:00:00:00 Host Name ABCDE

There are other variances, however, these are "..","::" that can be remedied by applying a Replace(s,"::",":")etc

i looked around at possibilities ($1) for example but i can only find...not replace the duplicate chars

once i can apply (function) to the regular expression to return correct (1st char) chars in the ip and mac it would damn near eliminate non 'DETAILS' pattern match

Code:
 Function DetailsFilter(s)

  f = Trim(s)

  Set regEx = New RegExp
  regEx.Global = true
  regEx.Pattern = "(I|II)(P|PP):|\s(M|MM)(A|AA)(C|CC):\s|\s(H|HH)(o|oo)(s|ss)(t|tt) (N|NN)(a|aa)(m|mm)(e|ee):\s"

  f = Replace(Trim(regEx.Replace(f," "))," ",",")

  ' Catch 'non standard' host names - some have spaces in the name (not on domain/or not a computer; ie)
  ' Seems silly to have to reconstruct a replace of " " with ","
  ' However, it is needed to ensure the host is intact if host has a space
  
  If INstr(f," ") > 0 Then
   a = Split(f,",")
   u = UBound(a)

   If u > 2 Then
    r = 2

    For c = 0 to u
     If c < r Then
      f = f & a(c) & ","
     Else
      f = f & a(c) & " "
     End If
    Next
   End If
  End If

  DetailsFilter = f
 End Function

 Function IsValidRegEx(p,s)
  Set regEx = New RegExp
  regEx.Global = true
  regEx.Pattern = p  
  regEx.IgnoreCase = False
  IsValidRegEx = regEx.Test(s)
 End Function

 Function PadIP(ip)

 ' Used to pad leading zeroes for easier 
 ' sorting once imported to sql database

  If Len(ip) > 0 Then
   If INstr(ip,".") Then
    aIP = Split(ip, ".") 
    ip = ""
    
    If UBound(aIP) = 3 Then
     For each item in aIP
      If Len(item) >=1 AND Len(item) <=3 Then ip = ip & String(3-Len(item),"0") & item & "."   
     Next

      ip = Trim(Left(ip,Len(ip)-1))
    End If

    PadIP = ip
   End If
  End If
 End Function

 Sub VMS_ConvertHosts
  Dim oHostDictionary: Set oHostDictionary = CreateObject("Scripting.Dictionary")

  Set oFSO = CreateObject("Scripting.FileSystemObject")
  Set oFolder = oFSO.GetFolder(".")

  ReportSaveAs = "Report-" & MonthName(Month(date),true) & "-" &  Day(date) & "-" & Year(date) & ".txt"

  Set VMS_Convert_HOSTS = oFSO.CreateTextFile(".\VMS Report Converts\" & ReportSaveAs, True)
  Set VMS_Convert_BAD_HOSTS = oFSO.CreateTextFile(".\VMS Report Converts BAD\" & ReportSaveAs, True)

  VMS_Convert_HOSTS.WriteLine "REPORT NAME,ENTRY DATE,ID,IP ADDRESS,MAC ADDRESS,HOST NAME"

  CONST DETAILSPATTERN = "^\s*(\d{1,3}(\.\d{1,3}){3},(unspecified|[A-Fa-f0-9]{2}(:[A-Fa-f0-9]{2}){5}),(\w|-|\W){2,16})\s*$"
 
  For Each file in oFolder.Files 
   If Right(file.name,4) = ".txt" Then

    Set oFile = oFSO.OpenTextFile(file.name, 1)

    ReportFile = oFile.ReadAll
    oFile.Close

    aReportFile = Split(ReportFile,vbcrlf)    

    EntryDate = Date

    For i = 0 to UBound(aReportFile) 
       msgbox(DETAILS)

     If Trim(aReportFile(i))= "for:" Then ReportName = Trim(aReportFile(i+1))

     ' Each file contains an ID...with each ID contains enumerated hosts under it
     
     If Trim(aReportFile(i)) = "ID:" Then ID = Trim(aReportFile(i+1))

     ' Used to catch IP: variances in raw data - works better then 
     ' If INStr(aReportFile(i),"IP:") > 0
     
     If IsValidRegEx("(I|II)(P|PP):",aReportFile(i)) Then 

      DETAILS = LTrim(aReportFile(i))

      k = 1  

      ' Use do loop until string matches: Name: variances and 
      ' characters afterwards
      ' used \w for words...\W to catch non standard chars
      ' like ABCDE~1234, ABCDE #2
      ' not sure if this is the write approach - but so far ok     

      DO 

       DETAILS = DETAILS & LTrim(aReportFile(i+k))
       k = k + 1       

      Loop until IsValidRegEx("(N|NN)(a|aa)(m|mm)(e|ee): (\w|-|\W){2,16}",DETAILS) 

      ' a 'key' is used to ensure no duplicate rows get added to writeline,
      ' and ultimately away from the sql database

      KEY = ID & DETAILS 

      If not oHostDictionary.Exists(KEY) Then 
       oHostDictionary.Add  KEY, "DETAILS"

       ' Used a simple replace to get rid of repetitve chars when loop constructed
       ' the DETAILS string - this is where i need to created a function to
       ' return a cleaner version of the DETAILS string and feed
       ' IsValidRegEx accordingly
          
       DETAILS = DetailsFilter(Replace(Replace(DETAILS,"..",""),"::",""))

       If IsValidRegEx(DETAILSPATTERN,DETAILS) Then
        
        aDETAILS = Split(DETAILS,",")
        REPORTNAME = ReportName           
        ENTRYDATE = EntryDate
        ID = ID
        IP = PadIP(aDetails(0))
        MAC = aDetails(1)
        HOST = aDetails(2)
        DETAILS = REPORTNAME & "," & ENTRYDATE & "," & ID & "," & IP & "," & MAC & "," & HOST

        VMS_Convert_HOSTS.WriteLine DETAILS
       Else 
        VMS_Convert_BAD_HOSTS.WriteLine (i+k) & vbtab & ReportName & vbtab & DETAILS
       End If
      End If
     End If 
    Next
   End If  
  Next
 End Sub

 
update:

splitting DETAILS not needed because it has passed the pattern match - delimited with commas

Code:
       If IsValidRegEx(DETAILSPATTERN,DETAILS) Then
        
        VMS_Convert_HOSTS.WriteLine ReportName & "," & EntryDate & "," & DETAILS
       Else 
   ' ...
      End If'
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top