Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Very challenging delete SQL statement 2

Status
Not open for further replies.

JCAD1

Programmer
Apr 17, 2003
18
0
0
GB
We have a table into which data is loaded from an external source. The problem is that it contains duplicate, triplicate, etc primary key values, of which we want to delete invalid records and leave just one valid record, i.e with a unique primary key (id). We cannot do this while the data is being loaded because we may end up removing valid records.

Now this is the trick - we determine the records we want to retain by looking at four other fields (disdate, nhsnum, nhsstat and diag), i.e. where there are records with the same id and:

1) disdate is null or disdate is not null, keep record where disdate is not null.
2) nhsnum is null or nhsnum is not null, keep record where nhsnum is not null.
3) nhsstat = 02 or nhsstat = 03 or nhsstat = 06, keep record where nhsstat = 06.
4) diag codes, keep record where the length of diag is the greatest.

How do we write a query to delete duplicate, triplicate, etc records and leave just one valid record?

NB: We are using MS SQL 2000.

Thank you in advance for your help.

JCAD1


 
Something like:

Code:
delete from blah
where disdate is null
and exists(select * from blah X where X.id=blah.id and disdate is not null)

delete from blah
where nhsnum is null
and exists(select * from blah X where X.id=blah.id and nhsnum is not null)

delete from blah
where nhsstat in ('02', '03' )
and exists(select( * from blah X where X.id=blah.id and nhsstat = '06' ))

delete from blah
where len(diag) <> (select max(len(diag)) from blah X where X.id=blah.id )

Notes:
- 3rd query will leave duplicates for which 02 and 03 exist but 06 doesn't
- in 4th query: if diag can be NULL use isnull(len(diag), 0) instead of len(diag)

And yes, backup data :)
 
here's the code I came up with (a second way of doing the same thing as above) except in the 3rd query it works with all values, not just '02','03', and '06'. Also, it adds one more query (#5?) on the end which will resolve any additional duplicates which may have still existed as a result of all other criteria being non-unique.

Code:
We have a table into which data is loaded from an external source. The problem is that it contains duplicate, triplicate, etc primary key values, 
of which we want to delete invalid records and leave just one valid record, i.e with a unique primary key (id). We cannot do this while the data 
is being loaded because we may end up removing valid records.

Now this is the trick - we determine the records we want to retain by looking at four other fields (disdate, nhsnum, nhsstat and diag), i.e. where there are records with the same id and:

1) disdate is null or disdate is not null, keep record where disdate is not null.
2) nhsnum is null or nhsnum is not null, keep record where nhsnum is not null.
3) nhsstat = 02 or nhsstat = 03 or nhsstat = 06, keep record where nhsstat = 06.
4) diag codes, keep record where the length of diag is the greatest.

How do we write a query to delete duplicate, triplicate, etc records and leave just one valid record?

NB: We are using MS SQL 2000.

Thank you in advance for your help.

JCAD1
 
 
 
 -- REMOVE NULL disdate IF NON-NULL disdate EXISTS
select distinct primkeyfield
	into #Temp1
	from MainTable
	where disdate is not null
	group by primkeyfield
delete MainTable
	from MainTable join #Temp1 on MainTable.primkeyfield = #Temp1.primkeyfield
	where disdate is null
drop table #Temp1
 -- REMOVE NULL nhsnum IF NON-NULL nhsnum EXISTS
select distinct primkeyfield
	into #Temp2
	from MainTable
	where nhsnum is not null
	group by primkeyfield
delete MainTable
	from MainTable join #Temp2 on MainTable.primkeyfield = #Temp2.primkeyfield
	where nhsnum is null
drop table #Temp2
-- REMOVE NON-MAX nhsstat
select primkeyfield, max(isnull(nhsstat,'')) as maxnhsstat
	into #Temp3
	from MainTable
	group by primkeyfield
delete MainTable
	from MainTable join #Temp3 on MainTable.primkeyfield = #Temp3.primkeyfield
	where isnull(MainTable.nhsstat,'') <> #Temp3.maxnhsstat
drop table #Temp3
-- REMOVE NON-LONGEST diag
select primkeyfield, max(len(rtrim(isnull(diag,'')))) as diaglength
	into #Temp4
	from MainTable
	group by primkeyfield
delete MainTable
	from MainTable join #Temp4 on MainTable.primkeyfield = #Temp4.primkeyfield
	where len(rtrim(isnull(diag,''))) < #Temp4.diaglength
drop table #Temp4
-- REMOVE ANY OTHER DUPLICATES IF ALL OTHER CRITERIA HAS TIED
select primkeyfield, max(identitycolumn) as uniquekey
	into #Temp5
	from MainTable
	group by primkeyfield
delete MainTable
	from MainTable join #Temp5 on MainTable.primkeyfield = #Temp5.primkeyfield
	where MainTable.identitycolumn <> #Temp5.uniquekey
drop table #Temp5
 
Vongrunt & Skuhlman

I have tested both solutions you have given and they work brilliantly. Thank you so much for your help.

JCAD1
 
Having applied the suggested solution by vongrunt to remove duplicate primary key values (which has worked fine), we still have some more duplicate records in the table. This time, we have two types of duplicate records:

(i) In addition to the primary key, most of the other fields have duplicates except two or three which are non-duplicates. There is nothing we can do about these duplicates because they are data quality issues which need to be sorted out at the source.

(ii) In addition to the primary key, all the other fields have duplicates as well. We want to keep one record and delete the duplicates.

We have tried using the following code but to no avail:
Code:
 -- First select only unique records
select distinct *
into #acOPDup
from mytable 

-- Empty the original table
truncate table mytable

-- Repopulate it
insert into mytable
select *
from #acOPDup

-- Delete the temporary table
drop table #acOPDup

How do we remove the second category [see (ii) above] of duplicates and retain just one record?

Thanks for your help

JCAD1

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top