FASTER METHOD FOR FINDING DUPLICATE RECORDS?

inteleserve · May 15, 2003

SEE PROGRAM BELOW...

This program works great for finding duplicate records using state and phone fields out of the database. But when doing over 5 million records it is slow..... anybody have any ideas??? I know there is a distict command but not sure how i can use it because i need to know which records are duplicates.

close all
clear
use temp.dbf
index on phone tag phone
use temp.dbf order phone exclusive
set status off
set scoreboard off
set console off
set device to screen
STORE RECCOUNT() to Xtotal
STORE 0 TO nHowMany
DO WHILE !EOF()
nHowMany = nHowMany + 1
IF nHowMany % 1000 = 0 &&... every 1000 records
@12,20 say str((nHowMany / Xtotal)*100) + "% done deduping..."
ENDIF
store phone to p1
skip 1
store phone to p2
if p1=p2
replace state WITH "X"
skip -1
replace state WITH "X"
skip 1
endif
enddo

Mike Lewis · May 15, 2003

Inteleserve,

This is off the top of my head. I really don't know if it would be more efficient, but it might give you something to think about:

SELECT Phone, COUNT(Phone) FROM Temp GROUP BY Phone HAVING Count(Phone) > 1

That would give you a relatively small result set with one record for each duplicate phone number. You could then join it to the original table to find the actual duplicate records.

Mike

Mike Lewis
Edinburgh, Scotland

http://www.ml-consult.demon.co.uk

inteleserve · May 15, 2003

this way seems to pull up the bad records much quicker....what would be the best way to join these duplicate records that I just pulled up

wgcs · May 15, 2003

How about:

Code:

SELECT * FROM Temp RIGHT INNER JOIN (SELECT Phone as KeyPhone, COUNT(Phone) FROM Temp GROUP BY Phone HAVING Count(Phone) > 1) ON Phone=KeyPhone

I don't have a table to test this on to see if this is valid SQL in VFP; If the two SELECT's can't be in the same statement, they could be done in succession, instead.

Mike Lewis · May 16, 2003

Inteleserve,

this way seems to pull up the bad records much quicker

That's good.

....what would be the best way to join these duplicate records that I just pulled up

Something like:

SELECT Phone, COUNT(Phone) FROM Temp GROUP BY Phone HAVING Count(Phone) > 1 INTO CURSOR Dupes

SELECT * FROM Temp JOIN Dupes ON Temp.Phone = Dupes.Phone

Or, you can combine those into one SELECT, as Wgsc's example, although it would make sense to try it both ways and compare the timings.

Also, be sure that you have an index on the Phone field.

Mike

Mike Lewis
Edinburgh, Scotland

http://www.ml-consult.demon.co.uk

hodge2 · May 21, 2003

Add a logical column called dup.

Replace all dup with .t.
Index on phone tag phone uniq
Replace all dup with .f.

When set to natural order all the of the dups are marked with true in the dup field.

Mike Lewis · May 21, 2003

Hodge2,

That's a good solution. However, to be exact, it is not all of the duplicates that will be .T. -- one instance of each dupe will still be .F.

But it is still an effective method of locating the dupes.

Mike

Mike Lewis
Edinburgh, Scotland

http://www.ml-consult.demon.co.uk

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

FASTER METHOD FOR FINDING DUPLICATE RECORDS?

inteleserve

IS-IT--Management

Mike Lewis

Programmer

inteleserve

IS-IT--Management

wgcs

Programmer

Mike Lewis

Programmer

hodge2

IS-IT--Management

Mike Lewis

Programmer

Similar threads

Part and Inventory Search

Sponsor