Something like Select Distinct on multiple keys

dean12 · Nov 19, 2007

I've got a mental block on this one - maybe you can help.

I need to populate Table ABC from Table DEF. There are multiple columns but I need to ensure that I don't get duplicates based on a smaller group of those columns.

Assume both tables have columns A, B, C, D, E, F
and both tables are identical in their design.

If I want a list in Table ABC where there are no duplicate values in column C, then something like this would be close:

insert into ABC
select * from DEF where C not in (select c from ABC)

But that doesn't prohibit the possibility of more than one record from table DEF with the same value of column C. I can't do a SELECT DISTINCT because I can't populate all the columns when I do that.

It appears to get worse when its a multiple key thing like no duplicates on columns C & D. Then this would again be close:

insert into ABC
select * from DEF where C+D not in (select c+D from ABC)

Same problem as before plus this works only on certain data types for column C & D. If my C & D columns are mixed data types or say INT, then I guess a bunch of casts are needed.

Anyone have any good suggestions on how to do this efficently? Appreciate the help.

TheSQL · Nov 19, 2007

Let's say that A, B, and C are the key columns....

Code:

INSERT INTO ABC
SELECT * FROM DEF
LEFT OUTER JOIN ABC ON DEF.A = ABC.A AND DEF.B = ABC.B AND DEF.C = ABC.C
WHERE ABC.A IS NULL

ABC.A will only be null where the keys don't match (ex. where the record isn't already inserted).

Hope this helps.

mrdenny · Nov 19, 2007

Use the NOT EXISTS operator.

Code:

insert into ABC
SELECT * FROM DEF
WHERE NOT EXISTS (SELECT * FROM ABC WHERE ABC.A = DEF.A
     and ABC.B = DEF.B
     ...)

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

My Blog

ESquared · Nov 19, 2007

... which creates a left anti-semi-join, in case anyone was wondering ...

the anti-semi-join can be superior to other joins, it depends on the data and how many reads it has to perform to determine that there isn't an existing row. Sometimes a left join is better as TheSQL showed. For a semi-join, an inner join can sometimes be better, sometimes not.

dean12 · Nov 20, 2007

Pardon if I'm wrong but I believe these solutions still allow duplicates. I believe the WHERE clause applies one time not for each record. Thus if table ABC is empty and we have the following in DEF:

Column: A B C
Row 1 XXX 1 2
Row 2 YYY 1 2

Then the insert into table ABC will receive 2 rows. What I need is just 1 row. In this example, I don't really care which one it is but I need just 1 row.

SQLSister · Nov 20, 2007

Do you have an identity field (or natural PK) in your table? If so
then this might do it. If you don't you could alawys throw the records into a table variable or temp table that has an identity field

Code:

select t1.* from table1 t1 join
(select min(idfield) from table1 where colc is not null group by colc) t2 on t1.idfiled = t2.idfield

If you need to select uniqueness among several columns you would simply add them to the derived table in the the group by

"NOTHING is more important in a database than integrity." ESquared

dean12 · Nov 20, 2007

I think I'm just gonna use cursors.

ESquared · Nov 20, 2007

Don't go the cursors route just because you haven't found a solution yet.

People here CAN help you. But the problem is that you haven't described what you are trying to accomplish well enough for anyone to figure it out.

Give us some sample input data. Give us the desired resultset based on that sample data. I promise you that it will only take seconds for one of us here to come up with how to get that resultset from that data.

And please provide primary key information and column data types.

You need to invest some time really explaining the problem. Then we will be motivated to spend time helping you solve it.

dean12 · Nov 20, 2007

I wasn't taking a shot. Its just that I think this is a tough one. Let's try again.

I have:

A table with Contacts (persons)
A table of Opportunities (chances to sell things)
A table of Activities (things we have done)

All tables have a Marketing Project ID. The tables have the following columns:

Contacts: Contact ID Mkt ID
Opportunities: Opp ID Contact ID Mkt ID
Activities: Act ID Opp ID Contact ID Mkt ID

I want to build an Encounter table that draws data from all three tables listed above. The Encounter table looks like this:

Encounters: Act ID Opp ID Contact ID Mkt ID

There are some other fields on the Encounters table like an IDENTITY field for a record key and such but they are not important to this discussion.

I do not want any duplications of Contact ID + Mkt ID. Here's some source data:

Act ID Opp ID Contact ID Mkt ID
------ ------ ---------- ------
Activities: 1 1 1 1
2 1 1 1
3 2 5 1
Opportunities: 1 1 1
7 2 8
Contacts: 5 1
5 6

Here's what I want to end up with in the Encounters table:

Came From Act ID Opp ID Contact ID Mkt ID
--------- ------ ------ ---------- ------
Activities: 1 1 1 1
3 2 5 1
Opportunities: Null 7 2 8
Contacts: Null Null 5 6

We did not carry over one Activity because it's Contact+Mkt was duplicated.

We did not carry over one Opportunity because it's Contact+Mkt duplicated an Activity.

We did not carry over one Contact because it duplicated an Activity.

Hope that's clear and thanks for taking time to look at this issue. I know you guys have more important things to accomplish and I appreciate your time and effort.

SQLSister · Nov 20, 2007

It seems to me that what I posted should get you started in the right direction. However, I would probably do this in three steps. First grab the records from activities and put them in the table using a query simliar to what I posted.

Then grab the records from opportunities using a simliar query but with a left join to the encounters table where encounterid (or whatever you call the idfield)is null.

Then the third query would pull any remaining records from the last table still using that left join so that you don't get any records you already have.

The reason why I would do it in three steps is that the fields you are interested in joining on are differnt for each table and I think it will be easier to understand this way than to try to get an overly complex query that accounts for everything.

Here's a rough cut (very much untested)on how I see the first two steps, I think you can figure the third step from the second step.

Code:

insert into Encounters (ActID, OppId, ContactID, MktID)
select T1.ActID,t1.OppId, t1.ContactID, t1.MktID from activities t1 join
(select min(ActID) from activities where  group by OppID, ContactID, MktID) t2 on t1.ActID = t2.ActID

Insert into Encounters (t1.OppId, t1.ContactID, t1.MktID)
select t1.OppId, t1.ContactID, t1.MktID from opportunities t1 join
(select min(OppID) from opportuninties where  group by  ContactID, MktID) t2 on t1.oppID = t2.OppID
left join Encounters  e on e.ContactID = t1.ContactID and e.MktID = t1.MktId
where e.EncounterID is null

"NOTHING is more important in a database than integrity." ESquared

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Something like Select Distinct on multiple keys

dean12

MIS

TheSQL

Programmer

mrdenny

Programmer

ESquared

Programmer

dean12

MIS

SQLSister

Programmer

dean12

MIS

ESquared

Programmer

dean12

MIS

SQLSister

Programmer

Similar threads

Part and Inventory Search

Sponsor