Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Deduping Sql Table while protecting client data

Status
Not open for further replies.

EBOUGHEY

Programmer
Aug 20, 2002
143
0
0
US
Below is the code I am currently using to dedupe my data. While this works fine, I am not able to protect my client data and update the query_id on the data I want (purchased data) because SQL has to use the uniqueid field. My data comes in at different times. Purchased lists come in days before the client data, but since they have been imported first, they have the min(uniqueid) instead of the client data.

I have a field called 'filecode' which separates client from purchased data.

Is there any way other than creating a new uniqueid field based on an index that sorts by filecode to achieve the goal I need? My only reason for not wanting another field is for space reasons. Most files are over 200,000+ records



Update [ClientData_temp]
set Query_ID = 'DUPES'
from [ClientData_temp]a
where uniqueid not in(select MIN(uniqueid) from [ClientData_temp] B WHERE UPPER(B.LASTNAME) = UPPER(a.LASTNAME)
AND UPPER(B.ADDRESS1) = UPPER(a.ADDRESS1)
AND B.ZIP = a.ZIP
AND B.NTRNLKYCD = A.NTRNLKYCD
AND QUERY_ID IS NULL)


Thanks in advance,

Elena
 
Anybody have any info on this? I have asked several 'guru' types in my corporation that have not been able to come with a solution.

Elena
 
I think you might need to give an example of your data here. Are you saying that you have duplicate query_ids and you just want to keep the one where filecode = 'client'?

--James
 
No duplicate query_ids. Duplicate Names in the file. We have to keep the original data intact so I just code the query_id field with the word "DUPES" when it encounters a duplicate record. As you can see below, it is finding the Purchased data first due to the uniqueid field that is used within the update query and coding the Client Data as a duplicate record.


Unique ID: 58888888
Name: John Smith
Address: 123 Main St
CSZ: Anywhere MA 55555
Filecode: Client Data

Unique ID: 18888888
Name: John Smith
Address: 123 Main St
CSZ: Anywhere MA 55555
Filecode: Purchased Data


 
Couple more questions:

1) Is there always a "client data" record for a particular client? Could there be more than one?
2) Is there always a "purchased data" record for a particular client? Could there be more than one?
3) Will the uniqueid ever be duplicated?

--James
 

I can't figure out what you want to do, I guess that's the issue for other people who is trying to help you, the best way you describe your question is post the sample data and the output you want.
 
Answers:

1) There could be duplicates within the client data that would have to be coded

2) Sometimes we only have client data and nothing purchased

3) Uniqueid will never be duplicated...
 
So there's always at least one "client data" record (possibly more). That makes it easier, try this:

Code:
UPDATE clientdata_temp
SET query_id = 'DUPES'
FROM clientdata_temp t1
WHERE uniqueid <> (
		SELECT MIN(uniqueid)
		FROM clientdata_temp
		WHERE filecode = 'client data'
			AND lastname = t1.lastname
			AND address1 = t1.address1
			AND zip = t1.zip
	)

--James
 


BEFORE DEDUPE:



UNIQID FCODE QUERYID LAST ADDRESS ZIP

83378 P NULL ABBOTT XXX CABOT 05647
365237 DMS NULL Abbott XXX CABOT 05647


AFTER DEDUPE: (HOW IT LOOKS NOW):


UNIQID FCODE QUERYID LAST ADDRESS ZIP

83378 P NULL ABBOTT XXX CABOT 05647
365237 DMS DUPES Abbott XXX CABOT 05647



AFTER DEDUPE: (HOW IT SHOULD LOOK: NOTICE QUERYID FIELD)



UNIQID FCODE QUERYID LAST ADDRESS ZIP

83378 P DUPES ABBOTT XXX CABOT 05647
365237 DMS NULL Abbott XXX CABOT 05647





 
UPDATE clientdata_temp
SET query_id = 'DUPES'
FROM clientdata_temp t1
WHERE uniqueid <> (
SELECT MIN(uniqueid)
FROM clientdata_temp
WHERE filecode = 'client data'
AND lastname = t1.lastname
AND address1 = t1.address1
AND zip = t1.zip
)

This code does not address the Purchased data though. It would only pull duplicates from the client data alone. It also still looks at the uniqueid field to make the decision of which data is being classified as a duplicate record which means if purchased data has a lower uniqueid that the client data, it would protect the purchased data.

Elena
 
Can you just add the following linke to the end of the above query?

Code:
...
  OR filecode = 'purchased'

According to your description you will always want to mark purchased data as duplicates.

--James
 


try following code,

Code:
update clientdata set queryid = 'dupes' 
from clientdata t0 inner join  
(
select LAST, ADDRESS, ZIP 
from clientdata
group by LAST, ADDRESS, ZIP
having count(distinct FCODE) > 1
) t1
on  t0.LAST=t1.LAST 
 and t0.ADDRESS=t1.ADDRESS
 and t0.ZIP=t1.ZIP
 and t0.FCODE = 'P'
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top