Combining 17 tables to check for duplicates

trulyblessed · Aug 6, 2003

I have 17 tables that are fairly large (8 - 12 million records each) for each month of data. Each table contains the same exact field names and data types. I need to find where there are duplicates across the 17 tables (and even within each table) based on three fields: Nabp, RxNbr and RefillSeq for 3 particular months. I need to produce a list of these duplicates.

How can I do this? I would normally merge the tables and then do a duplicate check using the having count(*) >1 syntax, but these tables are so big, there is no way to merge them into one table and have the space to store it. Any suggestions would be greatly appreciated.

Kim

welshone · Aug 7, 2003

you could try joining the tables using the UNION query . ?

select nabp, rxnbr, refillseq from table 1
union
nabp, rxnbr, refillseq from table 2
union
nabp, rxnbr, refillseq from table 3
etc
then do the duplicate check in the where clause.

just an idea..

Jamfool · Aug 7, 2003

I think you need to use UNION ALL so that u dont remove any records. where do you put the duplicate check in a union query welshone? You can do a duplicate check for each table, but u want to check the result set surely? As there maybe a record in table1 that is the same as table4

The only other way i can think of doing this is via a temp table perhaps, and only inserting records from the 3 months you were looking at, and then doing a duplicate check on the records obtained.

trulyblessed · Aug 7, 2003

The only problem with the temp table is there would be 128,000,000 records for each month because each of the 17 tables has anywhere from 1 million to possibly 20 million records.

Kim

lienyi · Aug 7, 2003

I'll create a view for each table using the count(*) >1 to filter out the unduplicated rows. Then 'UNION ALL' all these 17 views again using the count(*) >1 to get the final result.

Jamfool · Aug 7, 2003

CREATE VIEW tview
AS
SELECT var1,var2,var3 FROM t1
WHERE (created >= '2003-08-04 00:00:00.000') and (created <= '2003-11-04 00:00:00.000')
GROUP BY var1, var2, var3
HAVING Count(var1)>1 And Count(var2)>1 And Count(var3)>1
UNION ALL
SELECT var1,var2,var3 FROM t2
WHERE (created >= '2003-08-04 00:00:00.000') and (created <= '2003-11-04 00:00:00.000')
GROUP BY var1, var2, var3
HAVING Count(var1)>1 And Count(var2)>1 And Count(var3)>1
UNION ALL
SELECT var1,var2,var3 FROM t3
WHERE (created >= '2003-08-04 00:00:00.000') and (created <= '2003-11-04 00:00:00.000')
GROUP BY var1, var2, var3
HAVING Count(var1)>1 And Count(var2)>1 And Count(var3)>1
go
SELECT * FROM tview
GROUP BY var1, var2, var3
HAVING Count(var1)>1 And Count(var2)>1 And Count(var3)>1

pascalsql · Aug 7, 2003

Another suggestion, ugly but acceptable if this is a one shot.

Extract the distinct values of Nabp, RxNbr and RefillSeq
Find out how many occurences each and which is less discriminant (ex Nabp has 100 distinct values)
create a temp table with only two other columns ie RxNbr and RefillSeq and write code like this:

loop
truncate #temp
copy all the data for the first value of Nabp
insert #temp select RxNbr, RefillSeq from table1 where Nabp = 'value1' ... 17 times
check for duplicates

loop with next Nabp value (cursor)

Add a checkpoint in the loop and truncate the log.

jmcquilling · Aug 7, 2003

If you can't combine the tables try

select Nabp,RxNbr,RefillSeq from table1 a
where exists (select * from table2 b
where a.Nabp = b.Nabp and
a.RxNbr = b.RxNbr and
a.RefillSeq = b.RefillSeq
)
or exists (select * from table3 c
where a.Nabp = c.Nabp and
a.RxNbr = c.RxNbr and
a.RefillSeq = c.RefillSeq
)
.
.
.
or exists (select * from table17 z
where a.Nabp = z.Nabp and
a.RxNbr = z.RxNbr and
a.RefillSeq = z.RefillSeq
)

This would find all the dups in table1. This approach does not take as much space but it does a lot of I/O and you had better have indexes! Once table1 is done you still need to run 16 more queries for each of the remaining tables.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Combining 17 tables to check for duplicates

trulyblessed

Programmer

welshone

Programmer

Jamfool

IS-IT--Management

trulyblessed

Programmer

lienyi

Programmer

Jamfool

IS-IT--Management

pascalsql

Programmer

jmcquilling

Programmer

Similar threads

Part and Inventory Search

Sponsor