Delete dupes with a twist! 2

TJRTech · Apr 30, 2004

We are generating data mining lookup table that has (redacted) the following integer columns:

COL_A COL_B
----- ------
100 5
100 6
200 10
200 11

The above data shows the "dupes", in that we really want COL_A values to appear only once (unique) and we would like the MAX associated COL_B value, resulting in:

COL_A COL_B
----- ------
100 6
200 11

Now, the right way to do this is of course, to make COL_A a unique PK, but this is a lookup table that is generated by a bulk insert/select from and our operational systems and the operational systems are generating dupe pairs (we found 6 records or 3 pairs in 1.3M records mined records).

Given the large number of records (could be tens of millions), we need a quick way to avoid putting in the dupes, or to delete ONE of the dupe records once the insert is done.

I tried lots of things, and the BEST thing I could come up with is to populate a temp table with the "lower" of the dup pairs (select from two instances of the same table where COL_A are identical and COL_B<>COL_B, group by COL_A, MIN(COL_B) and insert result into temp table). Then, I use the temp table in the DELETE clause and lastly remove the temp table.

It works, but is there a better way?

Extra points for a solution that works with tripples, quads (three or more records with same COL_A value and different COL_B values). Again, in this scenario, we only want the LAST (greatest) COL_B value.

Oh, I also thought it would be great if you could do a bulk insert into a table with a PK, and not fail the whole thing just because a handful of PK violations are generated...maybe a first-in-wins, or last-in-wins declaritive. I am dreaming, of course, right???

Thanks,
TR

donutman · Apr 30, 2004

Delete MyTable
from MyTable M1 inner join
(select Col_A, Max(Col_B) as MaxB from MyTable group by Col_A having count(*)>1) as M2 on M1.Col_A=M2.Col_A where M1.Col_B<M2.MaxB
Then there's the problem of multiple Col_B's that = Max(Col_B) or isn't that possible in you dataset? You need an identity column to make that easy too.
-Karl

donutman · Apr 30, 2004

Since I aliased MyTable as M1, the Delete MyTable might have to be Delete M1.
-Karl

MDXer · Apr 30, 2004

Try this I ran it as a test so it creates the table and records to test the delete.

Code:

Create Table deldupeTest (
	[ColumnA] int,
	[ColumnB] int
)

Insert Into deldupeTest (ColumnA,Columnb) Values (100,5)
Insert Into deldupeTest (ColumnA,Columnb) Values (100,6)
Insert Into deldupeTest (ColumnA,Columnb) Values (200,10)
Insert Into deldupeTest (ColumnA,Columnb) Values (200,11)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,12)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,1)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,5)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,30)

--The Piece your intrested in
Delete
From delDupeTest 
Where Columnb != (Select MAX(ColumnB) From delDupeTest d2 where d2.columnA = delDupeTest.ColumnA)

Select * from DelDupeTest

Drop Table deldupeTest

The return set is

Code:

ColumnA     ColumnB     
----------- ----------- 
100         6
200         11
300         30

As you can see it works with any number of records. The only thing it will not handle is a true duplicate of both A and B values.

If your extracting from a source system why not load these to a stage table and then clean them up prior to inserting into your main table. You could use distincts on your insert to filter out A B dupes. Infact from a stage table you could use insert updates that would filter this our while inserting.

"Shoot Me! Shoot Me NOW!!!"
- Daffy Duck

Calen · Apr 30, 2004

Thanks MDXer,

I was searching for the answer to a similar problem, and your solution worked for me.

So a thank you star to you.

Calen

TJRTech · Apr 30, 2004

Thanks...duh...a subselect that returns the max, and a delete where not the max. Makes perfect sense now.

I wonder which will be faster. The method with the temp table or the subselect? I like the subselect method as it removes dupes, trips, etc.

Thanks all.

TR

TJRTech · Apr 30, 2004

Oh, MDXer...as for your questions.

I am loading these into a stage table and cleaning (via the dedupe method) prior to using the stage as part of a lookup and cross-reference that joins a couple of other tables. I am inserting into this table with an INSERT/SELECT. From a real-world standpoint, Column A is a unique PK, but I think we have a bug in one of our operational apps because we get "dupes" when mining the data.

Thanks for all the help.

TR

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Delete dupes with a twist! 2

TJRTech

Programmer

donutman

Programmer

donutman

Programmer

MDXer

Technical User

Calen

Programmer

TJRTech

Programmer

TJRTech

Programmer

Similar threads

Part and Inventory Search

Sponsor