Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Delete dupes with a twist! 2

Status
Not open for further replies.

TJRTech

Programmer
Apr 8, 2002
411
0
0
US
We are generating data mining lookup table that has (redacted) the following integer columns:

COL_A COL_B
----- ------
100 5
100 6
200 10
200 11

The above data shows the "dupes", in that we really want COL_A values to appear only once (unique) and we would like the MAX associated COL_B value, resulting in:

COL_A COL_B
----- ------
100 6
200 11

Now, the right way to do this is of course, to make COL_A a unique PK, but this is a lookup table that is generated by a bulk insert/select from and our operational systems and the operational systems are generating dupe pairs (we found 6 records or 3 pairs in 1.3M records mined records).

Given the large number of records (could be tens of millions), we need a quick way to avoid putting in the dupes, or to delete ONE of the dupe records once the insert is done.

I tried lots of things, and the BEST thing I could come up with is to populate a temp table with the "lower" of the dup pairs (select from two instances of the same table where COL_A are identical and COL_B<>COL_B, group by COL_A, MIN(COL_B) and insert result into temp table). Then, I use the temp table in the DELETE clause and lastly remove the temp table.

It works, but is there a better way?

Extra points for a solution that works with tripples, quads (three or more records with same COL_A value and different COL_B values). Again, in this scenario, we only want the LAST (greatest) COL_B value.

Oh, I also thought it would be great if you could do a bulk insert into a table with a PK, and not fail the whole thing just because a handful of PK violations are generated...maybe a first-in-wins, or last-in-wins declaritive. I am dreaming, of course, right???

Thanks,
TR
 
Delete MyTable
from MyTable M1 inner join
(select Col_A, Max(Col_B) as MaxB from MyTable group by Col_A having count(*)>1) as M2 on M1.Col_A=M2.Col_A where M1.Col_B<M2.MaxB
Then there's the problem of multiple Col_B's that = Max(Col_B) or isn't that possible in you dataset? You need an identity column to make that easy too.
-Karl
 
Since I aliased MyTable as M1, the Delete MyTable might have to be Delete M1.
-Karl
 
Try this I ran it as a test so it creates the table and records to test the delete.

Code:
Create Table deldupeTest (
	[ColumnA] int,
	[ColumnB] int
)

Insert Into deldupeTest (ColumnA,Columnb) Values (100,5)
Insert Into deldupeTest (ColumnA,Columnb) Values (100,6)
Insert Into deldupeTest (ColumnA,Columnb) Values (200,10)
Insert Into deldupeTest (ColumnA,Columnb) Values (200,11)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,12)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,1)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,5)
Insert Into deldupeTest (ColumnA,Columnb) Values (300,30)

--The Piece your intrested in
Delete
From delDupeTest 
Where Columnb != (Select MAX(ColumnB) From delDupeTest d2 where d2.columnA = delDupeTest.ColumnA)

Select * from DelDupeTest

Drop Table deldupeTest

The return set is
Code:
ColumnA     ColumnB     
----------- ----------- 
100         6
200         11
300         30

As you can see it works with any number of records. The only thing it will not handle is a true duplicate of both A and B values.

If your extracting from a source system why not load these to a stage table and then clean them up prior to inserting into your main table. You could use distincts on your insert to filter out A B dupes. Infact from a stage table you could use insert updates that would filter this our while inserting.

&quot;Shoot Me! Shoot Me NOW!!!&quot;
- Daffy Duck
 
Thanks MDXer,

I was searching for the answer to a similar problem, and your solution worked for me.

So a thank you star to you.


Calen
 
Thanks...duh...a subselect that returns the max, and a delete where not the max. Makes perfect sense now.

I wonder which will be faster. The method with the temp table or the subselect? I like the subselect method as it removes dupes, trips, etc.

Thanks all.

TR
 
Oh, MDXer...as for your questions.

I am loading these into a stage table and cleaning (via the dedupe method) prior to using the stage as part of a lookup and cross-reference that joins a couple of other tables. I am inserting into this table with an INSERT/SELECT. From a real-world standpoint, Column A is a unique PK, but I think we have a bug in one of our operational apps because we get "dupes" when mining the data.

Thanks for all the help.

TR
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top