Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Deleting duplicates sits there "sending data" 1

Status
Not open for further replies.

SQLWilts

Programmer
Feb 12, 2001
651
GB
Hi,
My experience is all SQL server, so please excuse my complete ignorance.
I have "adapted" a SQL Server query to delete duplicates from a table. I have had to use a temp table as I find that I cannot use a self join (I can't read from the same table I am trying to delete from).
Anyway, I am running the query, and it just sits there indefinately (I killed it after 15 minutes - it took less than a second in SQL server). I am expecting it to delete approximately 400 rows from a table containing approx 17,000.
Anyway, your opinions as to what could be causing this to just run and run, any code adjustments and definately books on MySQL admin and SQL programming would be appreciated.
Code:
create temporary table tempActivity
  select * from Activity;

 

update Activity a

set a.Deleted = 1, a.UpdatedOn = CURRENT_TIMESTAMP

where a.Id = a.Id

  and a.CurrentVersion = 1

  and a.Deleted = 0

  and a.Id < (select max(ta.Id)

              from tempActivity AS ta

              where ifnull(ta.CustomerNumber,0) = ifnull(a.CustomerNumber,0) and ifnull(ta.CustomerTelephoneNumber,'') = ifnull(a.CustomerTelephoneNumber,'') and ta.OccuredOn = a.OccuredOn and ta.EditorId = a.EditorId and ta.CurrentVersion = 1 and ta.Deleted = 0);
 
I stared ten full minutes at this query and I really cannot see what it should do. What should it do?

As an extra, there are no indexes on Activity, so that makes the WHERE clauses really slow. If there is more than a little amount of data, add the indexes.

Why is there an "always true" in the WHERE clause? (a.Id=a.Id)


+++ Despite being wrong in every important aspect, that is a very good analogy +++
Hex (in Darwin's Watch)
 
Hi Don, and thanks for answering.
You are spot on about the indexes - I have added them - made a small difference, but not a vast amount.
The query sets a deleted flag to true - it is "removing" duplicates. We don't delete data from this table - we simply set a deleted flag.
As for the always true statement - I removed it.

**Update** I have created a couple of indexes and am now doing the update in around 9 seconds.
 
I stared for another 5 minutes and it seems that you are trying to "remove" duplicates, and want to keep only the last one. If you really want the last one, your query would be the way to go. If you just want to remove duplicates, you could do some extra tricks on your temp table. Your query has a problem of NULL values. This forces you to use functions instead of fields, which makes the query slow. But you can do this on the temp table:

Code:
UPDATE tempActivity SET CustomerNumber=0 WHERE CustomerNumber IS NULL;
UPDATE tempActivity SET CustomerTelephoneNumber='' WHERE CustomerTelephoneNumber IS NULL;
DELETE FROM tempActivity WHERE Deleted=1;

This should "fix" the null values and delete the rows that were actually marked as deleted from the temp table. Assuming you have set indexes on the temp table, this should perform quite fast. Even faster would be to build up the temp table from a query that would exclude the "Deleted=1" anyway. So instead of the above updates and delete, you might do:
Code:
create temporary table tempActivity
  select ifnull(CustomerNumber,0), ifnull(CustomerTelephoneNumber,''), OccuredOn, ... from Activity where Deleted=0;

Now here's a trick:
Code:
ALTER IGNORE TABLE tempActivity ADD UNIQUE INDEX(CustomerNumber, CustomerTelephoneNumber, OccuredOn, EditorId);

This needs some explanation. I add a uniqueness constraint here to physically remove duplicates from the temp table. The IGNORE keyword takes care of the duplicates and removes them instead of failing.

Now you should be able to mark anything in the live table deleted that is not deleted already and also not in the temp table. You probably should lock the live table for read or take race conditions into account in another way.

+++ Despite being wrong in every important aspect, that is a very good analogy +++
Hex (in Darwin's Watch)
 
Don,

Although what you have said will not achieve what I am trying to achieve (I think our posts crossed in cyberspace), it is still extremely useful and something that I know I will use to de-dupe another table in one of our MySQL DBs.

thanks loads
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top