SQL distinct values 4

CHUMPY · Feb 18, 2004

Hi,

I am trying to idenitfy duplicate records in two tables, the tables share exactly the same structure but there is no primary key. Is there a more efficient way to do this than below?

select * from T1 where (cola,colb,colc.........,colx) not in
(select cola,colb,colc,..............colx) from t2);

Thanks for any help

Gareth

carp · Feb 18, 2004

Try

SELECT * FROM t1
INTERSECT
SELECT * FROM t2;

This will give you a list of rows that are identical in both tables.

Elbert, CO
1043 MST

gunjan14 · Feb 18, 2004

Hi,

Isn't is better if we have column clause in the set operator. Means
(SELECT col1, col2, col3......
FROM t1
)
INTERSECT
(SELECT col1, col2, col3......
FROM t2
)

Because if the columns in the 2 tables are not matching positionally during table creation then
SELECT * FROM t1
INTERSECT
SELECT * FROM t2
may give erreneous result.

BR,
Gunjan

carp · Feb 18, 2004

According to the post,
"the tables share exactly the same structure".

If you want to type out all of the columns, go for it! But if the structures are "exactly the same", then I don't see the need.

CHUMPY · Feb 18, 2004

Thanks for the help guys, in this case I can guarentee the columns are in the same order.

Just as a matter of interest why is the intersect method more efficient than the in method?

Thanks

Gareth

gunjan14 · Feb 18, 2004

Yeah. I want to clarify my doubt regarding this
If i define the table like this
create table t1
(
col1 number,
col2 number
)
And during creation of second table
create table t2 (colx,coly) as
select col2,col1 from t1;

Cannot we say that structure in both the tables are same. Or the structure will be same if and only if each corresponding columns are also matching?

BR,
Gunjan

carp · Feb 18, 2004

Gareth -
The INTERSECT method may or may not be more efficient than an IN, depending on factors such as the number of rows in your tables, the indexing on the columns you are comparing. As I understand the inner workings of IN, Oracle will take a record from the outer table, and traverse the inner table until it either finds a match or comes to the last row of the inner table. An intersect will get the unique values from each table and return the ones that match.

As a simple test, I created a table with only four columns and only about 2500 rows, then ran both queries. The results are:

15:34:29 SQL> select * from test_table
15:34:32 2 intersect
15:34:33 3 select * from test_table;

1248 rows selected.

Elapsed: 00:00:07.06

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 INTERSECTION
2 1 SORT (UNIQUE)
3 2 TABLE ACCESS (FULL) OF 'TEST_TABLE'
4 1 SORT (UNIQUE)
5 4 TABLE ACCESS (FULL) OF 'TEST_TABLE'

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
34 consistent gets
0 physical reads
0 redo size
35745 bytes sent via SQL*Net to client
857 bytes received via SQL*Net from client
85 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
1248 rows processed

15:34:42 SQL> select * from test_table
15:35:28 2 where (user_name, user_id, created_by, creation_date)
15:35:29 3 in (select user_name, user_id, created_by, creation_date from fnd_user);

2496 rows selected.

Elapsed: 00:00:07.07

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=16 Card=1 Bytes=112)
1 0 HASH JOIN (Cost=16 Card=1 Bytes=112)
2 1 TABLE ACCESS (FULL) OF 'FND_USER' (Cost=12 Card=1248 Byt
es=31200)

3 1 TABLE ACCESS (FULL) OF 'TEST_TABLE' (Cost=3 Card=491 Byt
es=42717)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
252 consistent gets
0 physical reads
0 redo size
67825 bytes sent via SQL*Net to client
1438 bytes received via SQL*Net from client
168 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2496 rows processed

While the time differentials aren't dramatically different, compare the consistent gets required! Also, notice the difference in rows returned. This is because INTERSECT is going to return only the unique rows common to both tables - not every row that's common. This raises two other considerations: (1)the INTERSECT is probably going to return fewer values, yielding less network traffic (again, compare the above statistics). (2) If you are using IN and only want the distinct values, then you have to run

1 select distinct * from test_table
2 where (user_name, user_id, created_by, creation_date)
3* in (select user_name, user_id, created_by, creation_date from fnd_user)

1248 rows selected.

Elapsed: 00:00:07.06

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=41 Card=1 Bytes=46)
1 0 SORT (UNIQUE) (Cost=41 Card=1 Bytes=46)
2 1 HASH JOIN (Cost=17 Card=1 Bytes=46)
3 2 TABLE ACCESS (FULL) OF 'FND_USER' (Cost=12 Card=1248 B
ytes=31200)

4 2 TABLE ACCESS (FULL) OF 'TEST_TABLE' (Cost=4 Card=2496
Bytes=52416)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
85 consistent gets
0 physical reads
0 redo size
35744 bytes sent via SQL*Net to client
857 bytes received via SQL*Net from client
85 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1248 rows processed
which induces a sort in your execution plan.

Of course, the final consideration (which I alluded to above) is how long is it going to take you to type out all of those columns? For a wide table (say 100 columns) I would suggest the time required to get all of the columns typed in is going to offset a LOT of CPU cycles saved by using one method over another!!

Gnjan14 -
I would say both tables have identical physical structures - and now I understand your point a little better. So...the query would run without error (which is what I thought you were getting at, but now I see I misinterpreted your post) but the results might not be what you were looking for. Your approach (enumerate the columns) is the more conservative, safer approach, and if practical and necessary, I would strongly endorse it. With that said, we often have tables with 200 - 300 columns in them; enumerating each column would not be desirable, particularly if we know the LOGICAL as well as physical structures are identical (e.g., a mirrored table from another database).

I hope we have now both cleared up any misunderstandings.

Elbert, CO
1612 MST

gunjan14 · Feb 19, 2004

Hi Carp,

Thanks for giving insight into IN and INTERSECTION. This definitely helps in understanding and hence writing better codes.

Gunjan

CHUMPY · Feb 19, 2004

Hi,

Thanks for all the feedback and good insight into the differences between IN and INTERSECT.

I have another question problem that I seem to of come across kind of related. Basically I have created a tool that combines schemas of identical structure as consolidation of regional databases into one database. I am basically inserting across schemas, hence the original question of checking for duplicates.

However I have an insert statement similar to

insert into t1
(select * from t2 where PK1 not in
(select PK1 from t1));

where pk1 is the primary key of the tables.

What I am finding is that if I run statement directly from SQLworksheet I get reasonable performance. However it is also contained within a package and exectued as dynamic SQL so differing schema names can be used amongst other paraemters.

The perfomance is terrible and the first time it ran I got an error along the lines of end of communication channel. Apologies for not having the correct error code but I haven't managed to reproduce it yet.

The statement it seems to spend a lot of time involves inserting about 250k records. I thought about the indexing etc but that would not explain why the statement executed directly was so much quicker.

If you have any ideas on this it would be much appreciated.

Thanks

Gareth

carp · Feb 19, 2004

OK, a couple of things before we leave the original thread:
1. My prior description of how IN works is actually how EXISTS works. I believe (again, corrections solicited) that IN actually continues traversing the entire list, even if it has found a match. This is why Oracle recommends you not use IN if you are going to be testing against a relatively long list. EXISTS stops checking as soon as it finds a match, so in effect your searches on average will be half as long.
So -
one other approach to the problem would be:
SELECT *
FROM table1 t1
WHERE EXISTS (SELECT 'x' FROM table2 t2
WHERE t2.col1 = t1.col1
AND t2.col2 = t1.col2
AND .....);
Note that this involves a correlated query and DOES require you enumerate every row.

Now, on to the new problem (BTW, this should really have been started in a new thread):

One approach to this might be:

INSERT INTO table1
SELECT * FROM table2
MINUS
SELECT * FROM table1;

This will insert all of the rows that are in table2 that are not already in table1;

Elbert, CO
0836 MST

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

SQL distinct values 4

CHUMPY

Programmer

carp

MIS

gunjan14

Programmer

carp

MIS

CHUMPY

Programmer

gunjan14

Programmer

carp

MIS

gunjan14

Programmer

CHUMPY

Programmer

carp

MIS

Similar threads

Part and Inventory Search

Sponsor