Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing Cross duplicate from a data set

Status
Not open for further replies.

jrsansom

Programmer
Apr 16, 2003
19
GB
Afternoon,

I am struggling with what I thought should be an easy problem and would appreciate some assistance.

I have a data set, an example is outlined below, that I want to contain only distinct relationships, so to speak, between two columns (source,duplicate).

Code:
source | duplicate
1	2
1	3
1	4
2	1
2	3
2	4
3	1
3	2
3	4
4	1
4	2
4	3

So using the above example, I would only want the column pair 1,2 to be included and for 2,1 to be omitted. The same principal shopuld be applied to all other potential pairs.

I hope this makes sense, as you can tell I am not the best in the world at explaining things :)

I would be gratefult for any guidance you could provide or if you require clarification please do ask.

Many Thanks
 
First of all, what do you want to do this for?

second, this might get you started:

Code:
--setting up test data, you don't need this
declare @test table (source varchar(2), duplicate varchar(2))

insert into @test 
select '1', '2'
union all select '2', '1'
union all select '3', '1'
union all select '5', '3'
union all select '3', '5'
union all select '15', '3'
union all select '8', '10'
union all select '3', '15'
union all select '10', '8'


--this part you need
--just replace @test with your table name
--note that it is set to take value with minimum source
--when there is duplication

select source, duplicate
from @test 
where (source + duplicate) not in
(
select (duplicate + source) 
from @test
)
or
source in
(
select min(source) from @test
group by (convert(int, source) + convert(int, duplicate))
)

If you look at where this fails (i'm sure there are conditions where it will), you will be able to add to the where clause to narrow it down better.

Also, this assumes varchar columns, if you have an integer field you will need to switch around the converts (convert to varchar in first part of where clause).

Good Luck,

Alex

Ignorance of certain subjects is a great part of wisdom
 
Hi Alex,

Thank you for your reply.

The problem submitted is a very small part of much larger project, the aim of which is to basically consolidate a variety of data sources and perform de-duplication between them.

The value, pairs detailed in the problem represent what are known as duplicate group keys (a duplicate group is an integer value that identifies a set of records that duplicate on some already defined key).

So the table presented in the example shows where existing duplicate groups match to other duplicate groups, on some other key :)

The aim at this point then is for duplicate groups that match on some key, to be combined into one group, thereby deduping the entire data set.

Hope I have not put anyone to sleep.

Thanks again.
 
Ah, that makes (some) sense. Sounds like a miserable tremendously exciting project ;-)

let me know if that query works for you (or what you end up adding to make it work)

Good luck,

Alex

Ignorance of certain subjects is a great part of wisdom
 
I agree with Alex. This does sound like an interesting project. I have come up with a potential solution to this problem. It may not work for you, but it does appear to work on the test data you provided (assuming I understand the problem properly).

Code:
Select A.Source, A.Duplicate
From   [!]TableName[/!] A
       left Join [!]TableName[/!] B
         On  A.Source <> B.Source
         And A.Duplicate <> B.Duplicate
         And A.Source = B.Duplicate
         And A.Duplicate = B.Source
         And A.Source > B.Source
Where  B.Source Is NULL

This may not be the most efficient solution to the problem because the join conditions use <> and a > for the joins. I hope this helps.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Excellent gmmastros!

I knew a left outer join and testing for null was a solution as this was the first avenue I explored but I ended up joining every record to every other record becuase I did not excluded matches where ID's where the same, as in the code you provided.

I made a few changes to cater for exact duplicate records also.

Thanks for the help guys.

Code:
Select distinct A.sourceDuplicateGroup, A.matchDuplicateGroup
From   companyKeyGroupUpdateTable A
       left Join companyKeyGroupUpdateTable B
         On  A.sourceDuplicateGroup <> B.sourceDuplicateGroup
         And A.matchDuplicateGroup <> B.matchDuplicateGroup
         And A.sourceDuplicateGroup = B.matchDuplicateGroup
         And A.matchDuplicateGroup = B.sourceDuplicateGroup
         And A.sourceDuplicateGroup > B.sourceDuplicateGroup
Where  B.sourceDuplicateGroup Is NULL
order by A.sourceDuplicateGroup, A.matchDuplicateGroup
 
I noticed this from my previous post, but thought I would let it go until after you posted, indicating success or failure. Technically, you can leave this in, but it doesn't do anything for you, either.

Code:
Select distinct A.sourceDuplicateGroup, A.matchDuplicateGroup
From   companyKeyGroupUpdateTable A
       left Join companyKeyGroupUpdateTable B
         On  [s][!]A.sourceDuplicateGroup <> B.sourceDuplicateGroup[/!][/s]
         And A.matchDuplicateGroup <> B.matchDuplicateGroup
         And A.sourceDuplicateGroup = B.matchDuplicateGroup
         And A.matchDuplicateGroup = B.sourceDuplicateGroup
         And [!]A.sourceDuplicateGroup > B.sourceDuplicateGroup[/!]
Where  B.sourceDuplicateGroup Is NULL
order by A.sourceDuplicateGroup, A.matchDuplicateGroup

Basically, we have the same condition, so the first is unnecessary. I'm glad this has helped.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
George - That is what I was trying to do, but I left out this:

Code:
And A.Source > B.Source

Glad I was just overlooking something that should've been obvious, and not barking up the wrong tree entirely. As always, you the man!

Alex

Ignorance of certain subjects is a great part of wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top