Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Want to parse a Address Field that has all components in it 1

Status
Not open for further replies.

hkrawitz

MIS
Apr 14, 2012
5
US
I have a file where the field has the following example:

123 tent st Chicago IL 12345

As you can see there are no delimiters. I need to split this so that I can put the street address in one field, city in one field, state in one field and zip in one field.

I could use some assistance with my problem.
 
If the data is written to you as a 'data file', there must be some form of standardization to the file data.

With no delimiters, can you count on fixed field lengths, or a fixed number of spaces in each 'field'? Or what?

If it is a totally random (not fixed length, no fixed number of spaces, etc.), you are facing a major challenge.

I guess that you could back-count spaces from the end of the text line.

cZipcode = RIGHT(ALLTRIM(cTextString),RAT(" ",ALLTRIM(cTextString)+1)
cState = SUBSTR(ALLTRIM(cTextString),RAT(" ",ALLTRIM(cTextString),2)+1,2)
etc.

Good Luck,
JRB-Bldr
 
I second jrbbldr.

And I fear there is no fixed length fields.

Think alone of the problem a street name can have any number of words in it. Just to seperate the street from the city you'd already need knowledge (a list of valid cities eg) and apply some simple artificial intelligence to extract the parts of the address.

If you can avoid that by getting better source data, then do that. It's cumbersome to parse an address string written this way into it's parts. You'll end up with a percentage of addresses you will parse wrong. And that could be very high.

You might use a service for that parsing, see here, what I googled via "parse address usa":

In the detail page you'll see that even these services require a format with comma seperation of address parts.

Bye, Olaf.
 
One thing you can reliably do is to parse out the state and zipcode:

Code:
lcZip = GETWORDNUM(lcAddress, GETWORDCOUNT(lcAddress))
lcState = GETWORDNUM(lcAddress, GETWORDCOUNT(lcAddress) - 1)

And, if you can rely on the city always being one word, you can do this:

Code:
lcCity = GETWORDNUM(lcAddress, GETWORDCOUNT(lcAddress) - 2)

But that would fail for the likes of Des Moines and Las Vegas.

As the others have said, you really need a standardised format, with delimiters between the various elements.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy
 
Extracting zip and state surely is a begin. This also shows how easy it is for poastal services to handle the mail. The zip code is enough to find the next destination of a mail, and the state actually is already optional.

Normalising the data is harder to do.

To give some more clues: if you have a state and zip with the appropriate data you can already limit valid cities, and make a matching of all the valid cities with the address line, to see what would match. And the remaining text can be matched against a list of valid street names of that zip.

It can be done, you'd still have some uncertainty or in problemati situations would find more thanone or no match. But in the first place to parse any address, you would need a whole set of data about cities of each state, streets of each city. All related to zips. And then you would fail on pobox or other special addresses again. This is a job for a web service provided by som postal service.

Bye, Olaf.
 
There are commercially available databases for Zipcodes which contain the City, State, and other information.

The only problem is that for a given Zipcode, there are often multiple cities listed.

But if you had the Zipcode, you could do a Search for the City Name which closely matched that found in your Text String and could therefore 'extract' the part that matched - leaving you with the remaining Left part of the string to represent the Street Address.

It would be yet another 'Rube Goldberg' approach, but it might be workable.

Good Luck,
JRB-Bldr
 
...for a given Zipcode, there are often multiple cities listed.[/qoute]

My knowledge about US addresses is limited. In germany what compares to a zip code is a so called postal guide number and simply denotes a number representing some postal office. So it's sufficient for the long distance transfer of a mail. From there on, the rest of the route is sometimes even done by bike.

But the same rule applies, sometimes such a number is shared among several small cities, but the opposite case is perhaps more often, bigger cities have more postal guide numbers or zips (as I understand them).


Googled Rube Goldberg and my first finding was, he was an engineer known for nonsense machines (what heppens then? machine), but wikipedia also says he's a synonym for unnecessarily comlicated machines, doing something simple.

I wouldn't judge matching a list of cities to the address string as complicated. Parsing addresses is just something you judge as simple from the everyday life and thus don't think you need so much code and ado about this. But you just apply general knowledge about city names for example, and this actually is done quite the same way, you detect the words and detect a city matching your list of know cities in Illinois. You do much more complex things in your brain automatic, without breking it down to the single steps you do, it rather happens to you.

This is something you can only replace with a database to a computer, but then it can do much faster than a human, an more reliable. And know all cities, if your data is complete.

Bye, Olaf.
 
Use Mike's suggestion to get Zip/State. Leave the rest as a lump. The only thing I'd add is verifying the part that you extract as a zip code is an actual zip code (i.e. all digits *or* nine digits with an embedded dash).

It takes special black magic to properly parse addresses into their component parts.

I used to work for a company that managed truly huge amounts of data, including addresses. We had 700 TB of data on line at any given time (and nobody really knew how much on tape) and even with all of that to compare with and all of those examples to draw from we still sometimes got addresses wrong.

 
First of all I would like to thank everyone for there help. I finally used Mike's suggestions. Please read on as I need now to remove the information from the original field that i split off.

Thanks All

Howard


Mike,

Thanks I was able to split off the city, st and zip. My problem is to remove the characters from the address field that I know longer need. I have tried a number of commands without luck.

Possibly you may have an answer.

Thanks

Howard
 
Hi Howard,

Glad to hear my solution was of some help.

So, you now know the city, state and zip, and want to remove those items from the original address. Is that right?

If so, you can do it like this:

Code:
lcResult = STRTRAN(lcAddress, lcCity, "")
lcResult = STRTRAN(lcResult, lcState, "")
lcResult = ALLTRIM(STRTRAN(lcResult, lcZip, ""))

But you still have the problem of cities that have more than one word in their name.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy
 
..plus you also have a problem, if the city name is part of the street name, eg Hamburger Strasse in Hamburg (Germany).

So you would need to check if strtran() does only one replace (or remove in this case, replacing with "").

Let me ask some basic questions:
How many addresses do you have?
Do you really program against one sample address only?
If you have more addresses, take a bit of your time to search for special cases, eg PO Box adresses or foreign country addresses.
What is the source of these addresses? OCR (scanners)? If so you also would need to be prepared for misspellings, OCR errors. Even if not, addresses still could have typos in them.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top