We are generating data mining lookup table that has (redacted) the following integer columns:
COL_A COL_B
----- ------
100 5
100 6
200 10
200 11
The above data shows the "dupes", in that we really want COL_A values to appear only once (unique) and we would like the MAX associated COL_B value, resulting in:
COL_A COL_B
----- ------
100 6
200 11
Now, the right way to do this is of course, to make COL_A a unique PK, but this is a lookup table that is generated by a bulk insert/select from and our operational systems and the operational systems are generating dupe pairs (we found 6 records or 3 pairs in 1.3M records mined records).
Given the large number of records (could be tens of millions), we need a quick way to avoid putting in the dupes, or to delete ONE of the dupe records once the insert is done.
I tried lots of things, and the BEST thing I could come up with is to populate a temp table with the "lower" of the dup pairs (select from two instances of the same table where COL_A are identical and COL_B<>COL_B, group by COL_A, MIN(COL_B) and insert result into temp table). Then, I use the temp table in the DELETE clause and lastly remove the temp table.
It works, but is there a better way?
Extra points for a solution that works with tripples, quads (three or more records with same COL_A value and different COL_B values). Again, in this scenario, we only want the LAST (greatest) COL_B value.
Oh, I also thought it would be great if you could do a bulk insert into a table with a PK, and not fail the whole thing just because a handful of PK violations are generated...maybe a first-in-wins, or last-in-wins declaritive. I am dreaming, of course, right???
Thanks,
TR
COL_A COL_B
----- ------
100 5
100 6
200 10
200 11
The above data shows the "dupes", in that we really want COL_A values to appear only once (unique) and we would like the MAX associated COL_B value, resulting in:
COL_A COL_B
----- ------
100 6
200 11
Now, the right way to do this is of course, to make COL_A a unique PK, but this is a lookup table that is generated by a bulk insert/select from and our operational systems and the operational systems are generating dupe pairs (we found 6 records or 3 pairs in 1.3M records mined records).
Given the large number of records (could be tens of millions), we need a quick way to avoid putting in the dupes, or to delete ONE of the dupe records once the insert is done.
I tried lots of things, and the BEST thing I could come up with is to populate a temp table with the "lower" of the dup pairs (select from two instances of the same table where COL_A are identical and COL_B<>COL_B, group by COL_A, MIN(COL_B) and insert result into temp table). Then, I use the temp table in the DELETE clause and lastly remove the temp table.
It works, but is there a better way?
Extra points for a solution that works with tripples, quads (three or more records with same COL_A value and different COL_B values). Again, in this scenario, we only want the LAST (greatest) COL_B value.
Oh, I also thought it would be great if you could do a bulk insert into a table with a PK, and not fail the whole thing just because a handful of PK violations are generated...maybe a first-in-wins, or last-in-wins declaritive. I am dreaming, of course, right???
Thanks,
TR