Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

FASTER METHOD FOR FINDING DUPLICATE RECORDS?

Status
Not open for further replies.

inteleserve

IS-IT--Management
Dec 13, 2002
75
US
SEE PROGRAM BELOW...

This program works great for finding duplicate records using state and phone fields out of the database. But when doing over 5 million records it is slow..... anybody have any ideas??? I know there is a distict command but not sure how i can use it because i need to know which records are duplicates.



close all
clear
use temp.dbf
index on phone tag phone
use temp.dbf order phone exclusive
set status off
set scoreboard off
set console off
set device to screen
STORE RECCOUNT() to Xtotal
STORE 0 TO nHowMany
DO WHILE !EOF()
nHowMany = nHowMany + 1
IF nHowMany % 1000 = 0 &&... every 1000 records
@12,20 say str((nHowMany / Xtotal)*100) + "% done deduping..."
ENDIF
store phone to p1
skip 1
store phone to p2
if p1=p2
replace state WITH "X"
skip -1
replace state WITH "X"
skip 1
endif
enddo
 
Inteleserve,

This is off the top of my head. I really don't know if it would be more efficient, but it might give you something to think about:

SELECT Phone, COUNT(Phone) FROM Temp GROUP BY Phone HAVING Count(Phone) > 1

That would give you a relatively small result set with one record for each duplicate phone number. You could then join it to the original table to find the actual duplicate records.

Mike


Mike Lewis
Edinburgh, Scotland
 
this way seems to pull up the bad records much quicker....what would be the best way to join these duplicate records that I just pulled up
 
How about:
Code:
SELECT * FROM Temp RIGHT INNER JOIN (SELECT Phone as KeyPhone, COUNT(Phone) FROM Temp GROUP BY Phone HAVING Count(Phone) > 1) ON Phone=KeyPhone

I don't have a table to test this on to see if this is valid SQL in VFP; If the two SELECT's can't be in the same statement, they could be done in succession, instead.
 
Inteleserve,

this way seems to pull up the bad records much quicker

That's good.

....what would be the best way to join these duplicate records that I just pulled up

Something like:

SELECT Phone, COUNT(Phone) FROM Temp GROUP BY Phone HAVING Count(Phone) > 1 INTO CURSOR Dupes

SELECT * FROM Temp JOIN Dupes ON Temp.Phone = Dupes.Phone

Or, you can combine those into one SELECT, as Wgsc's example, although it would make sense to try it both ways and compare the timings.

Also, be sure that you have an index on the Phone field.

Mike



Mike Lewis
Edinburgh, Scotland
 
Add a logical column called dup.

Replace all dup with .t.
Index on phone tag phone uniq
Replace all dup with .f.

When set to natural order all the of the dups are marked with true in the dup field.
 
Hodge2,

That's a good solution. However, to be exact, it is not all of the duplicates that will be .T. -- one instance of each dupe will still be .F.

But it is still an effective method of locating the dupes.

Mike


Mike Lewis
Edinburgh, Scotland
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top