Eliminate Duplicate Entries (fairly complex)

ssampier · Sep 1, 2005

I have a fairly complex problem relating to duplicated customer data. Here is a brief overview of the type of information and how that relates to the problem at hand. I work at a small Internet service provider. We have several disparate systems that are not integrated or combined. When our salespeople take an order for Internet service, the orders goes to a “Wiring” system that provides customer name, address, username, etc to the technician that either connects DSL or setup dialup Internet. The billing system is also separate.

Our Access system is an attempt to consolidate all systems together. I perform monthly audits comparing each system to the information I have in Access and visa versa. Since the information I have is not uniquely identified, I have assigned the phone number as primary key.

I have inherited the system from my predecessor that did not use standard database normalization techniques. I have made an attempt to normalize in 5N, but stopped in frustration in 1N.

In any case, my billing information states that x customers are being billed for Internet. My access shows y customers are active and should be billed. X is smaller than y by several hundred. I am positive there is duplicate information in my system.

We use one main table that is related to several smaller tables containing all additional information such as email addresses, spam filters, etc. I would like to search this main table for the duplicate information. I have attempted to use the duplicate query wizard in the past without much success.

If the process is too complicated to post, point me in the right direction (book, website, etc). I will take it from there. Thank you for your time.

PCLewis · Sep 1, 2005

If you use the wizard on the one main table, you end up with something looking like:

Code:

SELECT MainTable.phoneNum, MainTable.OtherData1, MainTable.OtherData2, MainTable.OtherData3
FROM Assignment
WHERE (((MainTable.phoneNum) In (SELECT [phoneNum] FROM [MainTable] As Tmp GROUP BY [phoneNum] HAVING Count(*)>1 )))
ORDER BY MainTable.phoneNum;

Whihc works just fine. If you first go about identifying the duplicates in the main data, then take it from there. I would add another column called "Duplicate" as a yes/no, and in a form, allow someone to manually mark the duplicates. Alternatively, move them to another table in case you need to merge data later or something.

I can see the Find duplicates wizard getting tricky if you are trying to add joins in after the wizard gives you a result. To avoid that, just design a query that combines all fields of interest to you from all tables of interest and run the wizard on that query.

ssampier · Sep 6, 2005

I used the wizard. I specified first name and last name. I received 251 results. We have businesses that do not have a first name, but we list their business name under last name.

Is there anyway to include similar names as well? For instance, John Smith and John R. Smith.

Thanks for your help.

PCLewis · Sep 6, 2005

Rule of thumb:
Work out how you would do it "by hand" using a printout of both lists , a ruler and a pencil.

IF you come up with a rule like "Take the first word of the name and the last word of the name and compare those"
Then you may be OK until you strike two similar names that are in fact different:

"Fran Susan Smith" and "Fran James Smith" for example.

Names are terrible things to go on in my experience but like I said, if you come up with a rule that works then you can code it or add it to the SQL. The hard part is never the coding in my opinion

RustyAfro · Sep 7, 2005

My opinion is that you have to fix the problem instead of patching it.

You must create a unique ID for each entry. This may mean a lot of manual work, but once it is done, you can use the unique ID throughout the application.

ssampier · Sep 8, 2005

RustyAfro, I agree. We definitely need a unique primary key. I do not see how a unique (but arbitrary) number would assist me in this process, however.

In the future, I need to push the need for the customer number to appear on documentation. Ideally, our systems should be integrated so one system, whether billing or wiring, is updated, all other systems are updated, as well.

I work with great people, but they can be very technologically lazy. They'd rather do things manually than code a much quicker solution.

PClewis, you are probably correct that I could not code an easy fix. I may have to create two (or three or four) duplicate queries, compile the duplicates into one query, and then print out the extraneous entries.

Thank you for your help. Tek Tips has been invaluable.

RustyAfro · Sep 9, 2005

I now understand (I think) more of your problem. Would a Soundex solution help find some duplicates? I don't think it would but it may be worth looking into.

When I have had to match people in different datasets with no unique key, I have tried:

Full First Name
First (3) of the Last Name
First (3) of Address 1
Zip Code

It's by no means perfect, but I have found that it matches *most* people.

Now, to eliminate duplicates, I use a trick in access I learned a long time ago.

Make a new table and make the First Name, Zip Code, and the two columns that hold the first 3 of the last name and address 1 all primary keys. Include all your other fields that must be combined into one data set.

Then, append your records from different datasets into that table (with the first (3) query columns appending to the custom first (3) columns in the table). Access will warn that "x" amount of rows were not appended due to PK violations. Click "Yes" and it will append the ones that do not violate the rules (even better, set warning to false in a macro or VBA and no error pops up at all). This will leave you with a clean (hopefully) dataset with no duplicates. This works perfectly when you do have a unique key(s), but should work 99-100% if you have to use a setup like I am using above.

If you need to see which ones will trigger the error, the duplicate query wizard on those same fields will show you.

The two columns I chose for the first (3) digits may not be good for your databases, so choose two other columns that would better for your needs. Sometimes adjusting the (3) to (2) or (4) is better, test it out.

Hope some of this helps

ssampier · Sep 12, 2005

I think you have a better understanding. This is my first professional "database" job. I have taken classes in Access and SQL server, but this is different from class exercises. My supervisor is a nice guy, but is Access phobic, and does not quite understand data theory (I only understand the basics). Needless to say, I can feel overwhelmed at times.

What I ended up doing is create two duplicate queries. One query compares First and Last Name. The second query compares last name and street. For some reason our table features separate fields for North, South, East and West. I think I may create a third query that compares last name and the cardinal directions.

Thanks for all your help.

ssampier · Sep 24, 2005

I found part of the problem. Some of our customers have multiple "circuits", such as DSL or dialup, usually on a different phone number or physical location (street address); a one to many relationship.

We place the service information as a field in the main table. That probably should be broken off into a separate table. That will probably introduce other issues, but I move that information into its own table for now.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Eliminate Duplicate Entries (fairly complex)

ssampier

ISP

PCLewis

Programmer

ssampier

ISP

PCLewis

Programmer

RustyAfro

Programmer

ssampier

ISP

RustyAfro

Programmer

ssampier

ISP

ssampier

ISP

Similar threads

Part and Inventory Search

Sponsor