De-duplicating records 1

Ferdalizer · Oct 22, 2014

I have read many of the posts here regarding SQL select statements for de-duplication of tables and I wanted to get everyone's opinion of the method that I use.

The table TMP, has the fields:

name_first, Name_last, Address1, address2, city, state, zipcode

All addresses in the table are standardized to postal specifications.

I create a KEY field of c(80) and I fill it with:

REPLACE ALL KEY WITH UPPER(STRTRAN(ALLTR(NAME_FIRST)+(UPPER(NAME_LAST)+ALLTR(ADDRESS1)+ALLTR(ADDRESS2)+ALLTR(CITY)+ALLTR(STATE)+SUBSTR(ZIPCODE,1,5)),' ','')

Then:

INDEX ON KEY TAG KEY

then I use:

SELECT DISTINCT *, COUNT(*) FROM TMP GROUP BY KEY

This returns a cursor of the de-duped records with a duplicate count appended.

I have been told that using the "GROUP BY KEY" could be returning too many records though all my checks of the de-duped tables turn out fine.

what do you think?

Thanks in advance.

Fred

Mike Lewis · Oct 22, 2014

Hi Fred,

Not quite sure what your question is. Are you asking if your SELECT works (your testing should tell you that)? Or are you saying that it doesn't work, and you want to know why?

I don't know what version of VFP you are using, but the syntax you have posted is not valid in 8.0 and above, unless you have SET ENGINEBEHAVIOR 70, which is not recommended. As the VFP Help explains:

The GROUP BY clause must list every field in the SELECT list except for fields contained in an aggregate function, such as the COUNT( ) function

Your own SELECT violates that rule.

With that in mind, I think this is what you want:

[tt]SELECT KEY, COUNT(KEY) FROM TMP GROUP BY KEY HAVING COUNT(KEY) > 1[/tt]

That will give you all the keys belonging to the duplicated records, which I think is what you want.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Oct 22, 2014

>I have been told that using the "GROUP BY KEY" could be returning too many records

Well, aside of the invalid GROUP BY clause, if you have data like f1 C(3), f2 C(3), eg "1 ","23 " and "12 ","3 " and define a key ALLTR(f1)+ALLTR(f2) that is "123" for both different records.
Take a look at SYS(2017). It'll help you find potential dupes. If the CRC32 checksums of two records are different you can be sure they are different. If they match it's very rare the records are different, as the CRC algorithm is designed to result in a totally different checksum value, if even just a typo means two letters are switched.

Bye, Olaf.

Ferdalizer · Oct 22, 2014

Mike,

I am using VFP 9.0 and I do indeed issue the SET ENGINEBEHAVIOR 70 and in testing it works very well. I have been using this process for quite a while now, so there is no PROBLEM so to speak.

The "SELECT KEY, COUNT(KEY) FROM TMP GROUP BY KEY HAVING COUNT(KEY) > 1" statement only returns the KEY field and I need all of the fields, which I copy to a de-duped table for presorting.

Olaf,

In address tables such as the data that I work with very seldom is there only a 1 to 5 digit field being compared. I appreciate what you are saying though. The Checksum (SYS(2017)) is very interesting.
One of the things that I have been searching for is a way to obtain a slightly looser match (comparing and selecting keys requires an exact match). so that if the address "123 MAIN ST APT 1, LAS VEGAS, NV, 89032" and "123 MAIN ST UNIT 1, LAS VEGAS, NV 89032" could possible return a partial (maybe) match.

Thanks to both of you for the feedback.

Fred

Mike Lewis · Oct 22, 2014

Fred, you say your code works well and that there is no problem. So, what exactly are you asking us?

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Oct 22, 2014

I was just using a few digits, as the example should be short, glueing the ALLTRIMed values always can make differing records look same, no matter what you glue together. The nature of addresses will make it very unlikely you get a match of non matching addresses, though.

As far as I remember I proposed sorting the address data char by char as one other indicator for same addresses, which for example removes typos as difference.
More conventional would be using SOUNDEX and or computing Levenshtein distance. Well, and you can try STRTRAN("APT","UNIT") to normalize some terms. Put together several ideas and you get more matches or partial matches or near matches. You could also try to put addresses into a maps service and see, whether you get same latitude/longitude. Many ideas can find simialr or same addresses, some take more, some less time.

Bye, Olaf.

Mike Lewis · Oct 23, 2014

Fred,

What you are asking for is called fuzzy matching. De-duping addresses in this way is a common requirement in many industries and businesses.

Before you can implement it, you have to define your criteria. It is up to you to establish the rules, for example, that "Unit 1" is the same as "Apt 1", or that "Blvd" is the same as "Boulevard", or that two people with different surnames living at the same address are, or are not, to be considered the same.

This is not a SQL problem. It's not even a programming problem. It's a business issue. There's no point in thinking about keys or writing SELECT statements until you know the business rules.

And I'm still not clear what your original question was.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Oct 23, 2014

Mike said:
And I'm still not clear what your original question was.

Fred said:
I wanted to get everyone's opinion of the method that I use.

To sum up my opinion, I'd rather use checksums for faster exact matching and for securing the match is real finally compare field by field of records with same checksum.
I have given some ideas for near matches and matches taking typos into account.
Mostly I'd use that to bundle likely matches for review. The review can then be partly programmatic, comparing all the single fields you can be sure it's a real match. What remains then is subject to be reviewed by users.

Fred said:
I have been told that using the "GROUP BY KEY" could be returning too many records

In regard to that I explained what can go wrong with such a alltrimed concatenation key. No matter how unlikely, it'd mean you'd put differing addresses into the same group, therefore getting too few addresses rather than too many.

SELECT DISCTINCT * FROM table would already suffice to do an exact match deduplication, adding grouping by the key you create from concatenation you gain a count, but now DISTINCT is not the main factor anymore, the groups will define how many record your result will have.

Obvious extreme cases:
If you'd group by a constant, you'd get one address only.
If you'd group by city, you'd get one address per city only.
If you'd group by a unique value (eg the primary key id, if your records would have one, or the recno) you'd get every record.

All this would be the case despite of combining the grouping with DISTINCT.

Both too many and too few results can be the case, depending on what's defining the groups.

Your scenario is creating a key which you group by, that's very unlikely to be the same key for differing records, so your grouping is in sync with DISTINCT. In other words, if you feel your key generation is sufficient to never have the same key for different records, you can even drop the DISTINCT key word. ENGINEBEHAVIOR 70 will pick the first record of a group of records having the same key value, COUNT(*) adds the count of records.

Bye, Olaf.

Mike Lewis · Oct 23, 2014

Just to add a couple of thoughts. We've also got to know:

- How diffuse the addresses are? For example, are most of them likely to be in the same one or two zipcodes? Or all over the world? Or at what point between those extremes?

- How much manual review are you prepared to do?

For example, if you expect the addresses to be very widespread (the entire US or world), then you might consider looking for dupes on zipcode only, and then manually reviewing them for actual duplicates. But that wouldn't be practical if the addresses were more clustered. But in that case, you might take the same approach by doing an initial de-dupe on, say, the first word within the address.

Hence the need for more information about your circumstances.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Dan Freeman · Oct 23, 2014

Thoughts on de-duping addresses in general:

It's a black art. Accept that you'll have a low success rate.

I used to work for a BIG DATA company that could get up in the 90-ish percent rate of address matching for de-duping. How? Boatloads of data. Our main computer center had FOURTEEN ACRES of raised floor (with two mirror facilities flung around the country). It was row after row of DEC Alphas constantly churning through 700+ terabytes of live/recent data.

We were actually more accurate than the USPS de-duping service, which could reach a 80=ish percent success rate. They didn't have our data or our data centers or our experience. They had a LOT of resources, but not what we had.

No matter what algorithm you use, I suspect your success rate will be under 30%. It takes more resources to get higher than that.

Will you catch that "8th & I, SE" is the same as "8th & I S.E."? Strtran() will let you catch that one, but will you also match "Eighth and I Streets, Southeast"? Strtran() won't get that one. (By the way, that's ONE address for ONE place in the United States, but it's a real address and the general public actually uses those and dozens of other ways to represent it.)

What that leads me to is: why ask us? YOU know your requirements and whether or not your efforts so far will suffice. Not every organization needs a complete de-dupe. Not every organization can AFFORD a complete de-dupe.

If what you have so far works for you, well HIGH FIVE fella! Good on you! Let's celebrate!

TamarGranor · Oct 24, 2014

Will you catch that "8th & I, SE" is the same as "8th & I S.E."? Strtran() will let you catch that one, but will you also match "Eighth and I Streets, Southeast"? Strtran() won't get that one. (By the way, that's ONE address for ONE place in the United States, but it's a real address and the general public actually uses those and dozens of other ways to represent it.)

Great choice of example, Dan. Lately, I've seen an increasing number of people refer to that street as "Eye" rather than "I," which makes the problem even harder.

Just to add to the trouble with de-duping, as I said to a prospective client this morning, how do you know whether "John Smith," "Jack Smith" and "Jackie Smith" all at the same address is one person, two people or three people?

Fred, you may find the article I wrote on this topic useful. It's at

http://www.tomorrowssolutionsllc.com/Articles/Sedna, VFPX and VFPy.PDF.

You'll have to scroll down past the first question & answer to find this one.

Tamar

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

De-duplicating records 1

Ferdalizer

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

Ferdalizer

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

Dan Freeman

Programmer

TamarGranor

Programmer

Similar threads

Part and Inventory Search

Sponsor