Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Splitting City, State, and Zip 8

Status
Not open for further replies.

alr0

Programmer
May 9, 2001
211
US

Hi All,

I am trying to write a city, state, zip splitter. I have a lot of the logic worked out once the field is split but sometimes that are many spaces internally, plus I need to look for commas as separaters also.

The code could repeatedly replace double spaces with single spaces or look at each character in the string with tons of Ifs but neither seems like an efficient solution.

Has anyone found a good way to accomplish this using Access 03?

Thanks,

alr

_____________________________________
There is no rule more invariable than that we are paid for our suspicions by finding out what we expected.

Henry David Thoreau
 
If I understand correctly you want to parse a string containing all three of the data items you mention into three separate fields?

Can you post some example strings for us to have a look at?

Ed Metcalfe.

Please do not feed the trolls.....
 



Hi,

By carefully examining the data, you have to identify consistent logic like...
[tt]
a comma ALWAYS follows the city and appears nowhere else

the State is ALWAYS a 2 character State Code

[/tt]
if your structure agrees with this logic, it should be not problem. The string to the left of the comma is the city.

Use the Split function...
Code:
sCity = split([TheString],",")(0)
sState = Trim(split([TheString],",")(1))
sZIP = split(sState, " ")(1)
sState = split(sState, " ")(0)


Skip,
[sub]
[glasses] [red][/red]
[tongue][/sub]
 



oops...
Code:
sZIP = split(sState, " ")(ubound(split(sState, " ")))

Skip,
[sub]
[glasses] [red][/red]
[tongue][/sub]
 
How are ya alr0 . . .

Here's an Idea:
Code:
[blue]Public Sub UnPack(Dat As String)
   Dim Ary, x As Integer, Nam As String
   
   Ary = Split(Dat, ",")
   
   For x = LBound(Ary) To UBound(Ary)
      Nam = Choose(x, "City", "State", "Zip")
      Me(Nam) = Trim(Ary(x))
   Next
   
End Sub[/blue]
[blue]Your Thoughts? . . .[/blue]

Calvin.gif
See Ya! . . . . . .
 
try

Code:
SELECT Location.CityStateZip , 
Left([CityStateZip],InStr([CityStateZip],",")-1) AS City, Trim(Mid([CityStateZip],InStr([CityStateZip],",")+1,3)) AS State, 
Right(Trim([CityStateZip]),5) AS Zip
FROM Location
 
Hi,

The data is from multiple sources with little or no consistency. e.g.

Bend OR 97701
Bend OR 97701-
Bend OR 97701-0001
Bend,OR 97701
Bend OREGON 97701
Bend, OR 97701
Bend 97701
Bend OR97701
97701

Just about every layout is possible but I need to make certain that I capture all of the legitimate USPS addresses correctly. There will be foreign addresses as well but I thought a good starting point would be to parse the field accurately and evaluate the components as a separate process.

There is already a lookup table of the 62 State codes with states spelled out in case they appear that way in the data.

I did have a type mismatch error with the split command. I tried assigning it to a string and variant variable types.

It just seems awful to look at each character in a loop to parse this field properly and eliminate the extra spaces.

Thanks for looking at this.

alr

_____________________________________
There is no rule more invariable than that we are paid for our suspicions by finding out what we expected.

Henry David Thoreau
 



The only way to cure BAD DATA, is to FIX THE DATA.

You have a HUGE data cleanup task.

Skip,
[sub]
[glasses] [red][/red]
[tongue][/sub]
 

The only way to cure BAD DATA, is to FIX THE DATA.

You have a HUGE data cleanup task.
Skip,



I knew we would get to this pretty quickly! This is why they need me!!!

So, I believe parsing is the first step. Any suggestions or is evaluating each character in a loop the best or only way?

Thanks,

alr

_____________________________________
There is no rule more invariable than that we are paid for our suspicions by finding out what we expected.

Henry David Thoreau
 
alr0 said:
[blue]The data is from multiple sources [red]with little or no consistency[/red].[/blue]
If you had only said this in the beginning! [surprise]





Calvin.gif
See Ya! . . . . . .
 



The first step is NOT to write code to fix.

The first step is to ANALYZE and CATEGORIZE the data. Data in various categories MAY PERHAPS be able to be "fixed" via code or some Global change.

For instance, two consecutive spaces can be replaces by ONE space. This can be done in a loop.

States without the TWO character Standard Post Office abbreviation can be changed to the two character abbreviation.

Addresses without ANY state will have to be looked up in a ZIP Code reference to find the correct State Abbreviation.

It's going to take alot of individual anaylsis & corrections to hammer the data into a consistant format of ...
[tt]
city, state zip
[/tt]
then the parsing will be SIMPLE.

Skip,
[sub]
[glasses] [red][/red]
[tongue][/sub]
 
Sorry AceMan

I was trying to be specific about the need to parse on multiple spaces as well as commas as opposed the usual csz situation without being too verbose.

alr

_____________________________________
There is no rule more invariable than that we are paid for our suspicions by finding out what we expected.

Henry David Thoreau
 
Zip code info is not my strength so I could be wrong but

One suggestion to parse this is, pull the zip code info out by using int() or is numeric() and getting it down to 5 characters
the zip code should give you the state info so parsing that is redundant (I think zips don't cover more than one state)
a lot of zip codes only cover on city so that only leaves those zips that cover more that one city that need further parsing
if data is as above, use instr to get city until you hit ascii character that is not between 65 and 122.
 
Skip,

I like the way you are looking at this except that it is not a one shot deal. This will be part of an application that processes 50+ files per week. We use a commercial csz splitter that is not very good, then correct things with queries, and as little hand work as possible.

Once this part is finished the application calls a DLL that standardizes addresses that are codable (CASS), completing missing city, state, or zips.

It is a pretty slick system with the biggest weakness in the csz area that I am trying to improve. If I can convert the csz field to an array of components then I can test for 5 or 9 digit zip, state code/name, and city. My thoughts were to clean things up first so it will be easier to evaluate what remains.

Does that seem reasonable?

alr

_____________________________________
There is no rule more invariable than that we are paid for our suspicions by finding out what we expected.

Henry David Thoreau
 



Then you have to analyze what the "slick" program misses. You may be able to figure out the logic to clean up 90% or so, of the recurring problems. But I would guess that there would be a certain percentage of the total that you may have to do by hand.

Good luck!

Skip,
[sub]
[glasses] [red][/red]
[tongue][/sub]
 
The "slick" program doesn't do csz at the moment but it sounds like a loop to look at each csz character is going to be required.

Any better ideas?

Thanks,

alr

_____________________________________
There is no rule more invariable than that we are paid for our suspicions by finding out what we expected.

Henry David Thoreau
 



IMHO, looking at each character does not get you much as the next step.

You need to take a 30,000 ft view, do some analysis and develope some logic.

This will result in groups of data. These groups can be exampined at 20,000 ft, and so on until you get to each final grouping. THEN you may be able to do some microscopic analysis/logic.

For instance, in the awful example that you posted...
[tt]
Bend OR 97701
Bend OR 97701-
Bend OR 97701-0001
Bend,OR 97701
Bend OREGON 97701
Bend, OR 97701
Bend 97701
Bend OR97701
[/tt]
First step, replace OREGON with OR. But that's assuming that there's no city like OREGONVILLE or E. OREGON. This is 30,000 ft analysis because you have to look at ALL the data, rather then a subset. Determing stuff like that takes DETAILED ANALYSIS of the data. It is actually LOOKING at the data and all it's differences and developing logic and then testing it to see how inclusive & complete it really is.

Skip,
[sub]
[glasses] [red][/red]
[tongue][/sub]
 
My first take on this would be something like the following algorithm:
Code:
   For each record
      Remove all extra spaces

      Look for most likely pattern first: "CITY, ST xxxxx"
      Is this the pattern?
         Yes: parse the values, then jump to next record
         No: continue

      Look for second most likely pattern: "CITY, ST xxxxx-xxxx"
      Is this the pattern?
         Yes: parse the values, then jump to next record
         No: continue

      .
      .
      .

   Next Record

This method should be efficient because most of the time you should be able to resolve the pattern early, and be able to move on to the next record.
You should be able to figure out the rankings of the patterns by analyzing your current data.


 
I don't know if this will help, but I think it should work for you.

Allen Browne has a function he wrote here:

I also found an article on Microsoft's site called "enhanced split function", located here:

I used Allen's function and added a part of the "enhanced split function" from the microsoft site and came up with this:

Code:
Function ParseWord(varPhrase As Variant, ByVal iWordNum As Integer, Optional strDelimiter As String = " ", _
    Optional bRemoveLeadingDelimiters As Boolean, Optional bIgnoreDoubleDelimiters As Boolean) As Variant
On Error GoTo Err_Handler
    'Purpose: Return the iWordNum-th word from a phrase.
    'Return:    The word, or Null if not found.
    'Arguments: varPhrase = the phrase to search.
    '         iWordNum = 1 for first word, 2 for second, ...
    '             Negative values for words form the right: -1 = last word; -2 = second last word, ...
    '             (Entire phrase returned if iWordNum is zero.)
    '         strDelimiter = the separator between words. Defaults to a space.
    '         bRemoveLeadingDelimiters: If True, leading delimiters are stripped.
    '             Otherwise the first word is returned as null.
    '         bIgnoreDoubleDelimiters: If true, double-spaces are treated as one space.
    '             Otherwise the word between spaces is returned as null.
    'Author:    Allen Browne. [URL unfurl="true"]http://allenbrowne.com.[/URL] June 2006.
    Dim varArray As Variant     'The phrase is parsed into a variant array.
    Dim strPhrase As String     'varPhrase converted to a string.
    Dim strResult As String     'The result to be returned.
    Dim lngLen As Long         'Length of the string.
    Dim lngLenDelimiter As Long 'Length of the delimiter.
    Dim bCancel As Boolean     'Flag to cancel this operation.
    Const CHARS = ".!?,;:""'()[]{}#"
    Dim intIndex As Integer
    
    '*************************************
    'Validate the arguments
    '*************************************
    'Cancel if the phrase (a variant) is error, null, or a zero-length string.
    If IsError(varPhrase) Then
        bCancel = True
    Else
        strPhrase = Nz(varPhrase, vbNullString)
        If strPhrase = vbNullString Then
            bCancel = True
        End If
    End If
    'If word number is zero, return the whole thing and quit processing.
    If iWordNum = 0 And Not bCancel Then
        strResult = strPhrase
        bCancel = True
    End If
    'Delimiter cannot be zero-length.
    If Not bCancel Then
        lngLenDelimiter = Len(strDelimiter)
        If lngLenDelimiter = 0& Then
            bCancel = True
        End If
    End If
                
    '*************************************
    'Process the string
    '*************************************
    If Not bCancel Then
        strPhrase = varPhrase
        'Remove leading delimiters?
        If bRemoveLeadingDelimiters Then
            strPhrase = Nz(varPhrase, vbNullString)
            Do While Left$(strPhrase, lngLenDelimiter) = strDelimiter
                strPhrase = Mid(strPhrase, lngLenDelimiter + 1&)
            Loop
        End If
        
         'Remove Chars
        For intIndex = 1 To Len(CHARS)
        strPhrase = Trim(Replace(strPhrase, _
            Mid(CHARS, intIndex, 1), " "))
        Next intIndex
        
        'Ignore doubled-up delimiters?
        If bIgnoreDoubleDelimiters Then
            Do
                lngLen = Len(strPhrase)
                strPhrase = Replace(strPhrase, strDelimiter & strDelimiter, strDelimiter)
            Loop Until Len(strPhrase) = lngLen
        End If
        
       'Cancel if there's no phrase left to work with
        If Len(strPhrase) = 0& Then
            bCancel = True
        End If
    End If

    '*************************************
    'Parse the word from the string.
    '*************************************
    If Not bCancel Then
        varArray = Split(strPhrase, strDelimiter)
        If UBound(varArray) >= 0 Then
            If iWordNum > 0 Then        'Positive: count words from the left.
                iWordNum = iWordNum - 1         'Adjust for zero-based array.
                If iWordNum <= UBound(varArray) Then
                    strResult = varArray(iWordNum)
                End If
            Else                        'Negative: count words from the right.
                iWordNum = UBound(varArray) + iWordNum + 1
                If iWordNum >= 0 Then
                    strResult = varArray(iWordNum)
                End If
            End If
        End If
    End If

    '*************************************
    'Return the result, or a null if it is a zero-length string.
    '*************************************
    If strResult <> vbNullString Then
        ParseWord = strResult
    Else
        ParseWord = Null
    End If

Exit_Handler:
    Exit Function

Err_Handler:
    Call LogError(Err.Number, Err.Description, "ParseWord()")
    Resume Exit_Handler
End Function

(I hope that formats right)

The code from the microsft article serves to pull out any characters (like commas, exclamation points, etc) that you specifically list from the string parts as you seperate them out. You can change (near the top) what chars should be removed.

This works very well for my purpose, which was to take a "firstname, lastname MI" string and break it up into individual names that are input into my 'contacts' form that popups in a "notinlist" event procedure.

I believe you could use the same thing and just adjust the list of "chars" at the top so that you don't remove anything you want to keep (or keep anything you want to remove).

Thanks to Allen and MS (mostly Allen) for the OVERWHELMING majority of the code work. I just pieced the two together.

Hope this helps.
T

PS: I think the code formated ok but there are a couple of lines that returned that should be all on one line. I think they are pretty obvious.
 
If your not opposed to purchasing a third party tool we use PeopleSmith tools. You could use either Personator or RightFielder. This would require you putting your data into a "DBF" file first. Both products have tools to assist with this if needed.

Personator $395 will split and genderizes names and split CSZ information.

RightFields $495 will field all address elements and split CSZ information.

Check them out at
I've used the products for years in the direct mail business and they are great.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top