Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Address Cleansing Problem

Status
Not open for further replies.

DSJ1967

MIS
Aug 24, 1999
29
CA
This problem is not really a VB question but more a programming question in general.<br>
<br>
Has anyone ever had to establish links between two seperate mailing databases? I have a database that requires updates to be applied to it from various electronic sources but I need to determine whether an update needs to be applied or not based on me having the same address as is supplied. <br>
<br>
The problem lies in the fact that not all addresses are formatted in the same fashion. i.e. Suite numbers might be in the second line of address or at the beginning of the first or some people write Rd. sometimes it is Road or Rd with no punctuation etc. The permutations are endless. Obviuosly using string comparisons is not going to work. <br>
<br>
Is there something else that might work (i.e Soundex) that someone has used in the past (Mike or Alt255). Any information or tips in this exercise would really help.<br>
<br>
Thanks
 
This is a very &quot;general programming&quot; answer.<br>
With no prior knowledge of your situation I assume you live in a perfect world and can do as you please with your databases. So, try this....<br>
Look at the addresses as packets of information, rather than collections of fields, ie. First/Last Name, Suite/Apt., Street/POB, City, State/Province, ZIP/etc. Look at the the addresses as identifiers that allow real people (postal employees, et al) to deliver items to real people living at a real, physical address. The Post Office cares not that your database has broken the discrete &quot;physical address&quot; into multiple, meaningless fields. It only wants to deliver the mail, as quickly and efficiently as possible.<br>
Back to assuming you live in a perfect world....<br>
Create a mailing address field that contains all of the address elements (concatenate the Name/street/Suite/etc. fields into one discrete packet &quot;field&quot;). Print it on an envelope (don't forget the carriage returns) and mail it.<br>
Use the various electronic sources to obtain the data, to concatenate the fields, to update the databases.<br>
Someone will have to verify the data. If you use Soundex, someone will have to verify the data. If you use string comparisons, someone will have to verify the data. If you derive a totally unique method of associating Road with RD or Roed or Rowd or R, someone will have to verify the data.<br>
<br>
Simple is better. That's what I like about the Post Office. Tell them where to go and they deliver.<br>

 
My approach to this would be as follows:<br>
1) Consider a record as a possible duplication only if<br>
the zip code, last name, and at least three characters<br>
of the first name are identical.<br>
2) Separate the address fields into numeric and nonnumeric<br>
strings. Compare the numeric strings from the 1st record <br>
to each of the numeric strings of the 2nd. You must get<br>
100% matching to proceed to the next test. (By the way,<br>
create the numeric strings from either numbers separated <br>
by spaces or numbers surrounded by other characters).<br>
3) Throw away any nonnumeric strings less than 4 characters.<br>
4) If a majority of the remaining nonnumeric strings from<br>
the 1st record match nonnumeric strings from the 2nd,<br>
consider this a matched record.<br>
This is obviously not perfect, but fairly easy to code. I<br>
hope this helps.
 
mrregan/Alt255<br>
<br>
Thanks for you input. I was doing some more thinking on this and am thinking about proceeding using bits of both suggestions. I think I need to first sanitize the fields I am going to compare based on a set of rules, thereby getting rid of any potential noise, depending on the type of field ie. dispose of Ave,St. Street etc for address fields and Mr. Ms. Miss for Name fields. Once this is done, would it not be possible to incorporate a type of 'Fuzzy Logic'? If you have 3 elements (tokens) in an address i.e.Address Line1: 123 AnyStreet Address Line2:Suite #1 ... put it together to get a sanitized address of 1 123 Anystreet and you can match 2 out of the 3 tokens then you would have 2/3 or 66% fuzzy match? Do the same for Name and Postal/Zip Code etc increasing or decreasing the degree of match as required? I think the hardest part of this program would be developing the rulebase for all of the field type sanitation and of course that it could be potentially really slow. <br>
<br>
Any thoughts?<br>
<br>
Thanks again DSJ<br>

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top