Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Delete duplicates, how does this code work? 2

Status
Not open for further replies.

micang

Technical User
Aug 9, 2006
626
US
Hi All,

I came across this code to delete duplicate rows from a table in SQL 2000. I have tested it and it seems to do the trick, but I don't understand how it does it.

Basically there is a table with a unique field, in this case [EmployeeID], then we have numerous other columns that we want to check for duplicates and then delete the entire rows/s.

Code:
Delete table_name
where [EmployeeID] not in
     (select min([EmployeeID]) from table_name
       group by column1, column2, etc);

Can anyone explain how it does it? As I mentioned I have tested it on small sample data, but now I would like to implement it on live data (circa 30 million rows)

Thanks in advance.
 
Hi there,

if you would run the inner select statement
Code:
SELECT MIN([employeeID]) FROM table_name
       GROUP BY column1, column2

on let's say the folowing table

Code:
[b]employeeID [tab]column1 [tab]column2[/b]
1          [tab]John    [tab]Smith
2          [tab]Marc    [tab]Jones
3          [tab]John    [tab]Smith

The GROUP BY section would cause the following grouping to occur (internally):

Code:
[b]group 1[/b]
[b]employeeID [tab]column1 [tab]column2[/b]
1          [tab]John    [tab]Smith
3          [tab]John    [tab]Smith

[b]group 2[/b]
[b]employeeID [tab]column1 [tab]column2[/b]
2          [tab]Marc    [tab]Jones

then the MIN(EmployeeID) would return (per group):

Code:
[b]group 1[/b]
[b]employeeID [tab]column1 [tab]column2[/b]
1          [tab]John    [tab]Smith
3          [tab]John    [tab]Smith
[b]=> MIN(employeeID) = 1[/b]

[b]group 2[/b]
[b]employeeID [tab]column1 [tab]column2[/b]
2          [tab]Marc    [tab]Jones
[b]=> MIN(employeeID) = 2[/b]

so the result of the statement would be
Code:
employeeID
1
2

Now:

DELETE table_name
WHERE [EmployeeID] NOT IN
     (SELECT MIN([EmployeeID]) FROM table_name
       GROUP BY column1, column2);

is actually:

DELETE table_name
WHERE [EmployeeID] NOT IN (1,2);

which logically result in deleting the row with employeeID 3.

Regards,
Johpje
 
Thanks so much johpje for the thorough explanation, I now see exactly what it does.

I am really suprised how such a simple command gets rid of duplicates. Having looked around I thought it would be way more involved.

Thanks again

micanguk
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top