Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Merge duplicate records based on certain criteria 1

Status
Not open for further replies.

RobHat

Technical User
Nov 30, 2007
91
GB
Hi, not posted in a while. Could be seen as a good thing I suppose. :)

I am hoping someone can help me with an SQL query I am trying to write to be able to merge duplicate records. Firstly I would just like to add that the duplicates are not of my doing and that I have only recently took control of the database.

I have a Table called person which holds all the demographics of clients. However I have multiple clients that have 2,3 or in some cases 4 duplicates. All have different info and very few have an ID unique to them. I have managed to pull a list of people with duplicate records based on Forename, Surname and DOB. An example of a duplicated record is:

ClientID, Forename, Surname, DOB, Address1, Address2, Postcode, etc
1, John, Smith, 01/01/1990, 1 Test Street, Test Area, AB24CD
2, John, Smith, 01/01/1990, 2 Test Street, Test Area, EF56GH
3, John, Smith, 01/01/1990, 1 Test Street, , AB24CD

So the ID is different as it auto increments on insert, Names are the same as are DOB, but address's could be incomplete, duplicate or different.
I also have an appointments table which has the appointment info (ClientID, Date, etc) The ClientID is the Key between the 2 tables. In my head i think the query needs to work in the following:

Find Duplicates, Look for the record with most appointments associated with it and use it as the primary record, Merge all other data into the primary record (where address is the same fine, where address is missing fill the gap, where address is different check which has the most recent appointment and use that one) and then update the ClientID field in the appointment table, from any appointments with the old ID change to the primary ID.

Does that make sense. Sorry for the long winded post but I cant explain it any simpler really. I know in my head what I need it to do but have no idea about writing a query this complex. Any help or pointers would be appreciated.

Thanks Rob.
 
If there is much variance in the address data, I might consider that the newest address would be the most accurate. I assume the id is entered chronologically and there fore might be the more accurate.

You could also combine the appointments so that you get all appointments for each client.

Just a thought.

Simi


 
Thanks for the reply.

Unfortunatly the most recently entered record is not necessarily the most upto date. The duplicates have not come from a front end app but rather an import from other databases. meaning that the client record with the latest appointment would therefore be the most upto date as the address would be checked each time.

What do you mean combine the appointments?

Rob
 
In you example John Smith would have appointments for ClientID 1,2,3.

Once you determine which client Id you are going to go with. I would update the other 2 appointment Id's to the new one.

So if you had appointment id's for 1,2,3 and you choose 2 to be the correct client id then I would update the appointment ID's for 1 and 3 to be 2. So you would have all the appointment.
You might want to add a new column and put it there. So you have the original and new id.

Simi

 
I see what you mean yes, thanks. Any ideas on the actual query itself to do this?
 
Post some more test data from each... Might be able to help.

Simi
 
Thanks here is some more test data:

Person Table:

ClientID Forename Surname DOB CustomerNo Address1 Address2 Postcode
------------------------------------------------------------------------------------------------------------------------------------------
677, John, Smith, 1967-05-05, NULL, 1 Test1 Road, Test1, AB12 3CD
893, Jo, Bloggs, 1984-05-14, NULL, 2 Test2 Road, Test2, MN4 5OP
1565, Minnie, Mouse, 1975-05-08, 123456, 3 Test3 Road, Test3, UV8 9WX
4011, John, Smith, 1967-05-05, NULL, 1 Test1 Road, NULL, AB12 3CD
7085, Minnie, Mouse, 1975-05-08, 123456, 3 Test3 Road, Test3, UV8 9WX
12679, John, Smith, 1967-05-05, NULL, 1a Test1a Drive, Test1a, EF12 3GH
17459, John, Smith, 1967-05-05, 789101, 1b Test1b Close, Test1b, IJ12 3KL
18309, Minnie, Mouse, 1975-05-08, 123456, NULL, NULL, NULL
19400, Jo, Bloggs, 1984-05-14, 456789, 2a Test2a Road, Test2a, QR6 7ST


Appointment Table

Appointment_ID ClientID Appointment_Date
-------------------- -------------------- -----------------------
459, 677, 2007-06-19
2846, 1565, 2005-04-11
3118, 677, 2005-07-20
3214, 893, 2005-08-22
8401, 1565, 2003-04-01
15029, 18309, 1997-01-01
15165, 19400, 1996-10-30
16367, 12679, 1999-04-09
19718, 12679, 2005-07-29
20720, 17459, 2007-01-15
 
Hi guys, this is still an issue any ideas.

Rob
 
Rob,

Let's try building this in stages. The way I see it, the first thing to do is to determine which rows to keep and which rows to delete. I gotta warn you, this is gonna get ugly.

Based on the sample data you provided, we should probably end up with 3 clients, Jo Bloggs, John Smith, and Minnie Mouse. My thoughts are to join this data to the Appointments table and get the latest appointment. This will help us determine what to keep and what to delete.

So, Jo Bloggs has 2 appointments, 2005-08-22 and 1996-10-30. Since 2005-08-22 is newer, this is the one we should keep, right? In the Person table, Jo Bloggs has a clientid of 893 and another with 19400. Since the 893 (in the appointments table) has a newer date, this is the one we should keep, and 19400 is the one we should delete.

So, step 1 is to determine what to keep and what to delete.

Can you run the following query and let me know if it works?

Code:
Select Forename, 
       Surname, 
       DOB, 
       Appointment.Appointment_Date, 
       Appointment.ClientID, 
       ROW_NUMBER() Over (Partition By Forename, Surname, DOb                
            Order BY Appointment_Date DESC) As RowId
From   @Person As Person
       Inner Join @Appointment As Appointment
         On Person.ClientID = Appointment.ClientID

When you run this query, the last column will be RowId. If I am correct, any person with RowId = 1 should be kept and and person with a rowid > 1 should be deleted.

This code doesn't do any actual deleting or data modifications, it exists solely to determine if the first part of the logic is correct.

Let me know what you find.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thanks for the quick response. I have been trying this out but I dont think I am getting the desired result. My original post with the table details in was a basic mock up to explain the structure. However it was a little too basic. The actual structure is as follows (I have created a replica and scripted it):

CREATE TABLE [dbo].[Person](
[Person_ID] [bigint] IDENTITY(1,1) NOT NULL,
[Forename] [varchar](50) NULL,
[Surname] [varchar](50) NULL,
[DOB] [datetime] NULL,
[Address_Line_1] [varchar](50) NULL,
[Address_Line_2] [varchar](50) NULL,
[Address_Line_3] [varchar](50) NULL,
[Address_Line_4] [varchar](50) NULL,
[Postcode] [varchar](50) NULL,
[Tel] [varchar](50) NULL,
[varchar](50) NULL,

CREATE TABLE [dbo].[Contract](
[Contract_ID] [bigint] IDENTITY(1,1) NOT NULL,
[Person_ID] [bigint] NULL,
[DateStarted] [datetime] NULL,

CREATE TABLE [dbo].[Appointments](
[Appointment_ID] [bigint] IDENTITY(1,1) NOT NULL,
[Contract_ID] [bigint] NULL,
[Activity_ID] [bigint] NULL,
[AppointmentType] [varchar](50) NULL,

CREATE TABLE [dbo].[Activity](
[Activity_ID] [bigint] IDENTITY(1,1) NOT NULL,
[ActivityType_ID] [bigint] NULL,
[Date] [datetime] NULL,

As you can see the person table is the one with duplicates, a person can have multiple contracts, each with various appointments, and each appointment can have various activities.

I dont think this will cause a massive problem with clearing the duplicates just needs joining in the query properly. Based on your previous post I tried running the following:

SELECT Person.Forename, Person.Surname, Person.DOB, Activity.Date, Contract.Person_ID,
ROW_NUMBER() OVER (Partition BY Person.Forename, Person.Surname, Person.DOB ORDER BY Activity.Date DESC) AS RowID
FROM Person
INNER JOIN Contract ON Person.Person_ID = Contract.Person_ID
INNER JOIN Appointments ON Contract.Contract_ID = Appointments.Contract_ID
INNER JOIN Activity ON Appointments.Activity_ID = Activity.Activity_ID
ORDER BY RowID ASC

After running this I got a lot of results. I think it must have been showing a result for every activity. So although a client that was not a duplicate but we have had a lot of activity with showed up many times.

Maybe I have done something wrong in my translation from your sample and my test.

Sorry to be a pain. I do appreciate you spending time on this.
 
Just to be clear... the activity with the most recent date is the one you want to keep, right?

Based on your table structure, it looks like each person can have multiple contracts. Each contract can have multiple appointments, and each appointment can have multiple activities. Am I understanding this correctly?

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Yep perfectly, I obviously did'nt explain it as badly as I thought :)
 
Please try this:

Code:
;With RecentActivity As
(
Select Contract.PersonId, MAX(Activity.Date) As MostRecentActivity
FROM   Contract
       INNER JOIN Appointments ON Contract.Contract_ID = Appointments.Contract_ID
       INNER JOIN Activity ON Appointments.Activity_ID = Activity.Activity_ID
)
Select Forename, Surname, DOB, RecentActivity.MostRecentActivity, RecentActivity.PersonId,
       ROW_NUMBER() Over  (Partition BY Person.Forename, Person.Surname, Person.DOB ORDER BY RecentActivity.MostRecentActivity DESC) AS RowID
From   Person
       Inner Join RecentActivity
         On Person.PersonId = RecentActivity.PersonId


-George
Microsoft SQL Server MVP
My Blogs
SQLCop
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Hi again,
I ran this but got the error:

Column 'Contract.PersonId' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

I added a group by clause so it looked like:

;With RecentActivity As
(Select Contract.PersonId, MAX(Activity.Date) As MostRecentActivity
FROM Contract
INNER JOIN Appointments ON Contract.Contract_ID = Appointments.Contract_ID
INNER JOIN Activity ON Appointments.Activity_ID = Activity.Activity_ID)
GROUP BY Contract.PersonId

After running this query it seemed to run through fine. I have list of people, still with duplicates but the first of the 2 duplicates will have the most recent appointment associated with it and a number 1 at the end. The duplicate following it will have the older appointment and the number 2 in the RowID column like this:

FN SN DOB Date ID ROWID
John Smith 1990-01-01 2010-01-01 12345 1
John Smith 1990-01-01 2009-01-01 1234 2

If there is a 3rd duplicate then it does a 3rd row as follows:

FN SN DOB Date ID ROWID
John Smith 1990-01-01 2010-01-01 12345 1
John Smith 1990-01-01 2009-01-01 1234 2
John Smith 1990-01-01 2008-01-01 123 3

So I believe this is what we were expecting to see?
 
Sorry about the missing group by. It's kinda hard to write queries that work when you don't have tables to test against.

So I believe this is what we were expecting to see?

Yeah. The idea is, first identify what to keep and what to delete. So... for each FN, SN, DOB combination, there should be a ROWID with a one. If there are dups, then there will be other rows with 2, 3, 4, etc... in the RowId column.

Now try running this query:

Code:
;With RecentActivity As
(
Select Contract.PersonId, MAX(Activity.Date) As MostRecentActivity
FROM   Contract
       INNER JOIN Appointments ON Contract.Contract_ID = Appointments.Contract_ID
       INNER JOIN Activity ON Appointments.Activity_ID = Activity.Activity_ID
 GROUP BY Contract.PersonId
), Dupes As
(
Select Forename, Surname, DOB, RecentActivity.MostRecentActivity, RecentActivity.PersonId,
       ROW_NUMBER() Over  (Partition BY Person.Forename, Person.Surname, Person.DOB ORDER BY RecentActivity.MostRecentActivity DESC) AS RowID
From   Person
       Inner Join RecentActivity
         On Person.PersonId = RecentActivity.PersonId
)
Select A.ForeName, A.SurName, A.DOB, A.PersonId As PersonIdToKeep, B.PersonId As PersonIdToDelete
From   Dupes As A
       Inner Join Dupes As B
          On A.ForeName = B.ForeName
          And A.SurName = B.SurName
          And A.DOB = B.DOB
          And A.RowId = 1
          And B.RowId > 1

I think it will show you the ID's to Keep and the ID's to delete. When you run this query, you should spot check the data to make sure it's returning the info you expect it to.



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
No worries, I am impressed so far. I have been staring at these tables trying to figure out the best method of doing this for ages. My next step was to go manual he he. Now my challenge is understanding what each piece is doing lol :)

Looks fine to me. Has returned all duplicates and put the correct id in the keep column etc. Where people have more than 2 entries there is an additional row entered to show the extra id which needs deleting.

Thanks again for your time with this.
 
Sorry to be a pain, any thoughts on the merge and delete.

Thanks again for your help with this GMMastros.
 
Hi
Sorry I'm back on this again. I have been looking into where to go from here and have a few ideas. Just wanted to run them by you for feedback?

So from the previous result could I:

Copy the previous resultset in a temp table then:

loop through the results using CURSOR and

if for example postcode in the record to keep is Null:
and the telephone number in the record to delete is not Null:
Then insert todelete.TelNo into the tokeep.TelNo

I know this is a really badly written example but before I start spending ages writing the query I wondered if I was at least barking up the right tree??

Any advice or ideas welcome. Thanks again
 
After looking a bit more. Do I need to use the CURSOR function or would a WHILE loop be ok?
 
Do I need to use the CURSOR function or would a WHILE loop be ok?

Six of one, half dozen of the other. I usually use while loops instead of cursors because I can never seem to remember the syntax for a cursor. Some people will say that cursors are horrible for performance (and they are right). What most people don't realize is that WHILE loops are no better than cursors for performance. In your situation, it doesn't really matter because the intent is to clean the data once, and then throw away the code. It doesn't matter if this process takes 1 second to run or one hour. Once it's done, it's done.

if for example postcode in the record to keep is Null:
and the telephone number in the record to delete is not Null:
Then insert todelete.TelNo into the tokeep.TelNo

I still wouldn't use a cursor for this because (to me), a set based method is easier to understand. Looking at the previous query, we have the ID to keep and the ID to delete on the same row, so we can join back to the table two more times and use the coalesce and NULLIF functions to return the correct data.

NULLIF is a function that returns NULL if the first parameter matches the second parameter. We can use this function to return NULL for is in case of empty data.

Coalesce is a function that will return the first parameter that is not null.

So...

Suppose we have a table like this.

[tt]People
PersonId Telephone PostCode
-------- --------- --------
1 12345 abc123
2 xyz789
3 456789 NULL
[/tt]

Note that the telephone data for PersonId = 2 is empty string, not NULL.

Code:
Select  Coalesce(NullIf(ToKeep.Telephone, '), NullIf(ToDelete.Telephone, ')) As Telephone,
        Coalesce(NullIf(ToKeep.PostCode, '), NullIf(ToDelete.PostCode, ')) As PostCode
From    People As ToKeep
        Inner Join People As ToDelete
          On ToKeep.PersonId = 3
          And ToDelete.PersonId = 3

You need to be careful about "related data". For example, an address. There is likely an Address1, Address2, City, State, and PostCode. If you treat each of this separately, you could end up getting a mish-mash of data between multiple rows.



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top