Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Query to remove 'almost' duplicates?

Status
Not open for further replies.
Sep 17, 2008
2
US
Hi. I am creating a report from a really messy database. There has been a lot of entry error resulting in duplicate records. I have been able to remove the exact duplicate records, but now I'm not sure how to get rid of the records that are off by a character or two due to mistyping (example: Mellisa Smith v. Meillisa Smith v. Melissa Smith). I don't need to remove them from the database - I just need them to not show up in the report.

Is there a way to do this?

Thank you in advance for your help,
Staci
 
Hi,
Sorry to say this but you probably can't - even Soundex would fail with that kind of problem.

The Database admin ( hopefully not you) might be able to help but it would be a difficult task unless some other field could be used to eliminate multiple entries ( even then, how to know which is corect?)



[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
You could use Left and Right to find identical start and end text, say four characters of both start and end but the length is different, say. But it will be imperfect - I'd suggest using Crystal to find probably duplicates and then have a human sort which are real. (E.g. Robert McTown and Roberta Town.)

[yinyang] Madawc Williams (East Anglia, UK). Using Windows XP & Crystal 10 [yinyang]
 
PS, if you have date of birth, that would be a useful cross-check to identify different people with different but similar names. (Or even the same name - how many cases of John Smith?)

[yinyang] Madawc Williams (East Anglia, UK). Using Windows XP & Crystal 10 [yinyang]
 
Can you give a couple of examples of duplicate data (As with the 'Mellisa Smith' example).

Also as mentioned can you let us know if there are any other fields related to these name fields that may serve as a useful way to identify duplicates.

There are ways of performing proability matches but to enable these to be accurate we would need asmuch info as possible.

'J

CR8.5 / CRXI - Discovering the impossible
 
The data is pretty much just contact info - name, title, company, address, city, state, zip, phone, etc. An example I was talking about before could be:

Melissa Smith
123 Main St.
Columbus, OH 12345

Meilsa Smith
123 Main St.
Columbus, OH 12345

I can't weed them out by address because I may have 20 different contacts at the same address.

Then it gets more complicated because I could have something like this where not only is the name entered in different twice, but the address or city may be spelled/entered differently in each record:

Sam Jones
4567 Springside Dr.
Bldg. 2
Phoenix, AZ 56789

Samantha Jones
4567 Springside Dr.
Building #2
Pheonix, AZ 56789
 
Is the address field a single long field with multiple lines or 1 field per line of the address?


If one line per part of address then add a couple of formulas:

//{@Initials}
({table.name}[1]) + mid({table.name},instr({table.name},' ')+1,1)

//{@Site#}
val(table.address1}

Add groups as follows:

Intital thoughts would be to group the results in the following order:

GH1 - Zip
GH2 - {@Site#}
GH3 - {@Initials}
Details - Display info here
GF3
GH2
GH1

This will display all results with potential matches in order.

To limit this to only showing possible close matches perhaps add a conditional suppression on the details section to limit to only those similar based on a match of zip + address + initials.

(Initials are less likely to be mis-spelt however I would also recommend going through the listing again afterwards for matches based on zip + addr and then zip only.

'J

CR8.5 / CRXI - Discovering the impossible
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top