Deduping Sql Table while protecting client data

EBOUGHEY · May 4, 2005

Below is the code I am currently using to dedupe my data. While this works fine, I am not able to protect my client data and update the query_id on the data I want (purchased data) because SQL has to use the uniqueid field. My data comes in at different times. Purchased lists come in days before the client data, but since they have been imported first, they have the min(uniqueid) instead of the client data.

I have a field called 'filecode' which separates client from purchased data.

Is there any way other than creating a new uniqueid field based on an index that sorts by filecode to achieve the goal I need? My only reason for not wanting another field is for space reasons. Most files are over 200,000+ records

Update [ClientData_temp]
set Query_ID = 'DUPES'
from [ClientData_temp]a
where uniqueid not in(select MIN(uniqueid) from [ClientData_temp] B WHERE UPPER(B.LASTNAME) = UPPER(a.LASTNAME)
AND UPPER(B.ADDRESS1) = UPPER(a.ADDRESS1)
AND B.ZIP = a.ZIP
AND B.NTRNLKYCD = A.NTRNLKYCD
AND QUERY_ID IS NULL)

Thanks in advance,

Elena

EBOUGHEY · May 4, 2005

Anybody have any info on this? I have asked several 'guru' types in my corporation that have not been able to come with a solution.

Elena

JamesLean · May 5, 2005

I think you might need to give an example of your data here. Are you saying that you have duplicate query_ids and you just want to keep the one where filecode = 'client'?

--James

EBOUGHEY · May 5, 2005

No duplicate query_ids. Duplicate Names in the file. We have to keep the original data intact so I just code the query_id field with the word "DUPES" when it encounters a duplicate record. As you can see below, it is finding the Purchased data first due to the uniqueid field that is used within the update query and coding the Client Data as a duplicate record.

Unique ID: 58888888
Name: John Smith
Address: 123 Main St
CSZ: Anywhere MA 55555
Filecode: Client Data

Unique ID: 18888888
Name: John Smith
Address: 123 Main St
CSZ: Anywhere MA 55555
Filecode: Purchased Data

JamesLean · May 5, 2005

Couple more questions:

1) Is there always a "client data" record for a particular client? Could there be more than one?
2) Is there always a "purchased data" record for a particular client? Could there be more than one?
3) Will the uniqueid ever be duplicated?

--James

maswien · May 5, 2005

I can't figure out what you want to do, I guess that's the issue for other people who is trying to help you, the best way you describe your question is post the sample data and the output you want.

EBOUGHEY · May 5, 2005

Answers:

1) There could be duplicates within the client data that would have to be coded

2) Sometimes we only have client data and nothing purchased

3) Uniqueid will never be duplicated...

JamesLean · May 5, 2005

So there's always at least one "client data" record (possibly more). That makes it easier, try this:

Code:

UPDATE clientdata_temp
SET query_id = 'DUPES'
FROM clientdata_temp t1
WHERE uniqueid <> (
		SELECT MIN(uniqueid)
		FROM clientdata_temp
		WHERE filecode = 'client data'
			AND lastname = t1.lastname
			AND address1 = t1.address1
			AND zip = t1.zip
	)

--James

EBOUGHEY · May 5, 2005

BEFORE DEDUPE:

UNIQID FCODE QUERYID LAST ADDRESS ZIP

83378 P NULL ABBOTT XXX CABOT 05647
365237 DMS NULL Abbott XXX CABOT 05647

AFTER DEDUPE: (HOW IT LOOKS NOW):

UNIQID FCODE QUERYID LAST ADDRESS ZIP

83378 P NULL ABBOTT XXX CABOT 05647
365237 DMS DUPES Abbott XXX CABOT 05647

AFTER DEDUPE: (HOW IT SHOULD LOOK: NOTICE QUERYID FIELD)

UNIQID FCODE QUERYID LAST ADDRESS ZIP

83378 P DUPES ABBOTT XXX CABOT 05647
365237 DMS NULL Abbott XXX CABOT 05647

EBOUGHEY · May 5, 2005

UPDATE clientdata_temp
SET query_id = 'DUPES'
FROM clientdata_temp t1
WHERE uniqueid <> (
SELECT MIN(uniqueid)
FROM clientdata_temp
WHERE filecode = 'client data'
AND lastname = t1.lastname
AND address1 = t1.address1
AND zip = t1.zip
)

This code does not address the Purchased data though. It would only pull duplicates from the client data alone. It also still looks at the uniqueid field to make the decision of which data is being classified as a duplicate record which means if purchased data has a lower uniqueid that the client data, it would protect the purchased data.

Elena

JamesLean · May 5, 2005

Can you just add the following linke to the end of the above query?

Code:

...
  OR filecode = 'purchased'

According to your description you will always want to mark purchased data as duplicates.

--James

maswien · May 5, 2005

try following code,

Code:

update clientdata set queryid = 'dupes' 
from clientdata t0 inner join  
(
select LAST, ADDRESS, ZIP 
from clientdata
group by LAST, ADDRESS, ZIP
having count(distinct FCODE) > 1
) t1
on  t0.LAST=t1.LAST 
 and t0.ADDRESS=t1.ADDRESS
 and t0.ZIP=t1.ZIP
 and t0.FCODE = 'P'

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Deduping Sql Table while protecting client data

EBOUGHEY

Programmer

EBOUGHEY

Programmer

JamesLean

Programmer

EBOUGHEY

Programmer

JamesLean

Programmer

maswien

Technical User

EBOUGHEY

Programmer

JamesLean

Programmer

EBOUGHEY

Programmer

EBOUGHEY

Programmer

JamesLean

Programmer

maswien

Technical User

Similar threads

Part and Inventory Search

Sponsor