Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing duplicates from large amount of records 2

Status
Not open for further replies.

Wullie

Programmer
Mar 17, 2001
3,674
GB
Hi all,

Been quite a few years since I've had to deal with duplicates in MySQL and have a large database that I need to remove duplicate records from, so looking for the least resource intensive/easiest way to do it.

I've got the following setup:

id, postcode, street_address, postal_town

Now, I'm looking to remove any duplicates where all 3 of the above (other than id) are duplicated in more than one row, so for example:

1,EH4 2BS,example terrace,edinburgh
2,EH4 2BS,example terrace,edinburgh
3,EH52 3SD,example street,broxburn
4,EH52 3SW,example road,broxburn
5,EH4 2BS,example terrace,edinburgh

In the above, rows 1 and 2 and 5 are duplicates and I want to remove all but one of those.

Things I need to take into account:

1) Street names can exist in different towns so I cannot just use something as simple as pulling out unique street names or unique postcodes, it needs to check if all 3 fields are identical before removing anything.
2) This table has just under 30 million records.
3) Any number of records could be duplicated and there could be any number of duplicates on each.

Thanks in advance,

Wullie

The pessimist complains about the wind. The optimist expects it to change. The leader adjusts the sails. - John Maxwell

Story Choices - Your Story, Your Way
 
The postcode IS THE unique item in your criteria though, EH4 2BS only applies to ONE street (or part of a street) in ONE town. Two postcodes may point to one street in one town, but they cannot point to one street name in different towns.

So all your delete routine has to copy ONE row (GROUP BY postcode) to a temporary table then delete the original and rename the temporary table.
So:
create a temp table, duplicate free with:

SQL:
CREATE TABLE temptable AS SELECT * FROM currtable WHERE 1 GROUP BY postcode;


Delete the original (with duplicates)
SQL:
DROP TABLE currtable;

rename the temp table to the same name as the former duplicated one

SQL:
RENAME TABLE temptable TO currtable;





Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum
 
Hi Chris,

Thanks for your response. Good to see some of the usual members are still here and posting on a regular basis. :)

I originally went the same route as you but then was told that a postcode can technically within the rules cover more than one street in some unusual instances, so cannot use just the postcode as I need to also take into account whether or not the other details are also the same before removing any duplicates.

Obviously with a very small dataset I could do this easily without worrying about the query, however in this case I'm looking for the least resource consuming way to do it with 30 million records.

Thanks in advance,

Wullie

The pessimist complains about the wind. The optimist expects it to change. The leader adjusts the sails. - John Maxwell

Story Choices - Your Story, Your Way
 
Try with

SQL:
CREATE TABLE temptable AS SELECT DISTINCT postcode, street, town FROM currtable WHERE 1;

that should remove duplicates where postcode, street and town are identical

To go further than that will need a recursive procedure to test the row criteria against the already selected records

Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum
 
surely you can just create a compound unique index? mysql will automatically remove duplicates

Code:
alter ignore table mytable
add unique index myCompoundIndex (
 postCode, address, city
)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top