Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Difficult parsing of text 2

Status
Not open for further replies.

SBelyea

Technical User
May 29, 2008
46
0
0
US
Hi tek-tippers -

I've got an interesting parsing situation here, and I'm hoping that this group may be able to help me out.

Within Access, I have a text field whose contents look something like this:

Code:
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.

I'm trying to (somehow) parse through this text field and:
1) count the number of times the word REJECTED appears, and;
2) when the word REJECTED appears, save the date it is rejected as a string that I can put into a new field. Because something may be rejected several times, but I want to combine all the rejection dates into a single field, I was trying to think of a way where I generate a number of variables = to the number of times REJECTED is found, and then storing the date of each rejection in the variable and concatenating the variables into a single string.

I've spent the better part of several hours trying to figure out a way to accomplish this (looking at code, understanding the Len() function, etc.) but have hit a wall.

Any suggestions?

Thanks!
 
It is easy enough to get a count of rejected using len and replace, but getting the date might be a little more difficult. Can you rely on phrases being separated by colon<space>, as illustrated above?

If so Split may suit:
Code:
s="05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here."
x=split(s,": ")


 
Thanks for the quick response Remou!

The phrase is only seperated by a colon<space> when it is REJECTED
Code:
05/13/2001 21:41:00 GMT - John Doe - REJECTED:

Otherwise, there are hyphens before and after the name.

Code:
05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.

Does this answer your question? Thanks again for the quick response!
 
A rough idea.

Code:
Dim st, ast, ln
st = "05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here."
ast = Split(st, ": ")

For Each ln In ast
If InStr(ln, "rejected") > 0 Then
    'Rejected lines are delimited by ": " so rejected will
    'be the last word in the split line
    Debug.Print Mid(ln, InStrRev(ln, "GMT") - 20)
    'Date only
    Debug.Print Mid(ln, InStrRev(ln, "GMT") - 20, 23)
End If
Next

 
Here's another way using a UDF and a Regular Expression:
Code:
Public Function GetRejectedDates(strField As String) As String
    Dim objRegExp   As Object
    Dim mc          As Object
    Dim m           As Object
    
    Set objRegExp = CreateObject("VBScript.RegExp")
    
    With objRegExp
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = "\d{2}/\d{2}/\d{4}(?=\s\d{2}:\d{2}:\d{2}\D+?REJECTED:)"
        Set mc = .execute(strField)
    End With
    
    Debug.Print "Field Contains " & mc.Count & " rejected records."
    
    For Each m In mc
        GetRejectedDates = GetRejectedDates & m.Value & "; "
    Next m
    
End Function
The above function will spit out a semi-colon delimited string of rejected dates (it will also output a count in the immediate window if you test it in the VBE, or check after you've run the function).

To use it in you query, just use the name of the function with the field name in parentheses. Something like:
Code:
RejectedDates: GetRejectedDates(YourField)
Hope this helps

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
For completeness, if you want the times as well as the dates (as per Remou's example) that the previous RegExp ignored you could use:
Code:
Function GetRejectedDatesandTimes(strField As String) As String
    Dim objRegExp   As Object
    Dim mc          As Object
    Dim m           As Object
    
    Set objRegExp = CreateObject("VBScript.RegExp")
    
    With objRegExp
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = "\d{2}/\d{2}/\d{4}\s\d{2}:\d{2}:\d{2} GMT(?=\D+?REJECTED:)"
        Set mc = .Execute(strField)
    End With
    
    Debug.Print "Field Contains " & mc.Count & " rejected records."
    
    For Each m In mc
        GetRejectedDatesandTimes = GetRejectedDatesandTimes & m.Value & "; "
    Next m
    
    Set m = Nothing
    Set mc = Nothing
    Set objRegExp = Nothing
    
End Function
Hope this helps

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
Thank you both for the assistance! I'll be testing them out this morning to see if they work.

I'm relatively inexperienced at VB and VBA, but had actually read about Regular Expressions last night on the wikibook for VB6 - what a coincidence (although I'm sure I'm still years away from properly implementing it)!

Thanks again, and I'll report back with my results later today.
 
Once you've tested the code examples, if you'd like any explanations as to how the regex's in the posts are working post back and I'll be happy to explain them in a bit more detail.

Regards

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
Hopefully a simple question (which makes rather clear my ineptness)! When I attempt to run my query with HarleyQuinn's code, with a field of the query called Rejection Date: GetRejectedDates(), I receive an error stating:
Code:
Undefined function "GetRejectedDates" in expressin

I created a module called GetRejectedDates in VBA, and pasted the code there. What step am I missing in getting this function to run?

Thanks again for the help, as always!
 
Just create a new module (leave the name as the default name for now) and paste ALL the code for the function in either of my examples in the module, then run the query (remember to put a field name into the () at the end of the function call in the query).

Hope this helps

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
Absolutely AMAZING. :-D

Remou and HarleyQuinn, thank you both for your help on this. Harley, the code worked like a charm - I'm not sure why it didn't work the first time I tried putting it into a module, but it certainly works now!


HarleyQuinn - within the code you wrote, my biggest question resides around
Code:
Pattern = "\d{2}/\d{2}/\d{4}\s\d{2}:\d{2}:\d{2} GMT(?=\D+?REJECTED:)"

More specifically, how does that line examine the string, and what do the \d{2}/ and \s\d{2} characters mean? I'm pretty intrigued by the capability of this code, and would love to learn how it works!

Thanks again to both parties!
 
I'm glad the code worked for you, thanks for the
star.gif


PHV has posted a very useful (and comprehensive) link on Regular Expressions which if you're wanting to learn about them should be essential reading.

Just to get you started with the string I used, here's another useful reference page that I use when writing expressions. For your specific questions have a look at the Character Classes section along with the Quantifiers section. That covers almost the whole expression apart from the Positive Lookahead (which you'll notice straight away once you've read about it in the links).

Hope this helps

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
Thanks for the information everyone! My only point of confusion with the code is the "lazy plus", or +?.

To make sure I'm understanding this correctly in the context of the code, the \D+? starts after the pattern search finds the time (20:15:08), and the pattern skips ahead until it finds REJECTED:. This is because a lazy plus can be repeated one or many times. Is this the proper way to look at the (?=pattern) section of the regular expression in the above example?
 
Pretty much exactly right. Here's a breakdown:
Code:
\d{2}/\d{2}/\d{4}
Find 2 digits (0-9), then 2 digits then 4 digits seperated by /
Code:
\s\d{2}:\d{2}:\d{2} GMT
Find a space (\s) followed by 2 digits, 2 digits and 2 digits each seperated by a : then a space and GMT
Code:
(?=\D+?REJECTED:)
Once it's matched all of the conditions above it uses positive lookahead to make sure the match is followed by anything not a digit (\D is the negated digit class), one or more times (+?) and then REJECTED:

As we've used lookahead the last bit of the match is not returned (it's just checked for).

Hope that helps clear things up.

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
HarleyQuinn, thanks for the clarification!

You'll be proud - for some reason in the Comment string, one of the lines had the date and time without the preceding 0s for the month and hour (not sure how this happened, this is the only occurence in all of the comment strings I've checked). To make sure that it still picked up the rejection date, I changed the pattern to:

Code:
.Pattern = "\d\d?/\d{2}/\d{4}(?=\s\d\d?:\d{2}:\d{2}\D+REJECTED:)"

.. making only 1 digit required for the beginning of the date and the time.

Thanks again for your help on this! [smarty]
 
HarleyQuinn, a final question (my apologies for this growing post)!

After reading more about RegExp and its properties, I am curious as to if ".MultiLine = True" is necessary - doesn't ".Global = True" make RegEx search the entire string?

To double check my assumption, I tested the code with the ".MultiLine = True" statement commented out, and I still returned the same (correct) results.

Is ".MultiLine = True" needed?
 
Glad You got the digit problem sorted, another way to write this (not using optional characters) is:
Code:
.Pattern = "\d{1,2}/\d{2}/\d{4}(?=\s\d\d?:\d{2}:\d{2}\D+REJECTED:)"
This basically means that it's looking for a digit a minimum of once and a maximum of twice.

While Multiline isn't strictly necessary in this case, I generally always use Multiline as it can affect the way the anchors (^ and $) are matched within a string. If you had a string that consisted of multiple lines and you wanted to check that a particular pattern matched the start of each line (rather than the start of the whole string) you would use Multiline, for example:

Using this as the base data (all lines are seperated by a vbNewLine):
Code:
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.
05/13/2001 21:41:00 GMT - John Doe - REJECTED: Something is not right. 05/05/2002 20:50:29 GMT-Roger Dodger-Looks good, all smiles here.
If we apply the following pattern we will get wildly different results depending on whether Multiline is explicitly set to True or False.

The pattern is:
Code:
"^\d{1,2}/\d{2}/\d{4}(?=\s\d{2}:\d{2}:\d{2}\D+?REJECTED:)"
With Multiline turned off, you'd return 1 match, with it on, you'd return 6.

So, basically, Multiline is very useful when dealing with lines of data.

Hope this helps

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
Thanks again for the clarification, HarleyQuinn. You're a terrific resource for this, and I really appreciate all of the assistance you've provided! :)
 
Thanks, glad I could help [smile]

HarleyQuinn
---------------------------------
Carter, hand me my thinking grenades!

You can hang outside in the sun all day tossing a ball around, or you can sit at your computer and do something that matters. - Eric Cartman

Get the most out of Tek-Tips, read FAQ222-2244: How to get the best answers before posting.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top