Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL Questions looking for duplicate data ...

Status
Not open for further replies.

owenyuen

Programmer
Feb 19, 2003
22
0
0
US
There is the table:
ColA ColB ColC ColD
0001 00011574 111500 Purchases Cost Pool
0001 00010001 280001 Home Office
0002 09910001 280001 Home Office
0001 00010002 280002 Short North
0001 00010003 280003 Affinity Deposits

Is any way I can write a SQL query to pull out the duplicate data that exists on ColC?

Result will be like this:

0001 00010001 280001 Home Office
0001 09910001 280001 Home Office

Thanks
 
Hi

I just created a table like this:

create table aol_euro ( col_a varchar2(4) not null, col_b varchar2(8) not null, col_c varchar2(6) not null, col_d varchar2(30) ) tablespace table_data;

And inserted the data you have:

select * from aol_euro;

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0001  00011574 111500 Purchases Cost Pool
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office
0001  00010002 280002 Short North
0001  00010003 280003 Affinity Deposits

Before I write the code, I guess you have a mistake in your output, because Col_A is 0001 on 2nd line and it must be 0002.

And my optimal query is:

select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A, (select col_c from aol_euro group by col_c having count(*) > 1 ) Multi_c
where A.col_c = Multi_C.Col_c;


The trace show:

Code:
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   MERGE JOIN
   2    1     SORT (JOIN)
   3    2       TABLE ACCESS (FULL) OF 'AOL_EURO'
   4    1     SORT (JOIN)
   5    4       VIEW OF 'VW_NSO_1'
   6    5         FILTER
   7    6           SORT (GROUP BY)
   8    7             TABLE ACCESS (FULL) OF 'AOL_EURO'




Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         32  consistent gets
          0  physical reads
          0  redo size
        618  bytes sent via SQL*Net to client
        499  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
          2  rows processed

And the output is:

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office


It is possible to use this query – but it is not the optimal query:

select col_a, col_b, col_c, col_d
from aol_euro
where col_c in (select col_c from aol_euro group by col_c having count(*) > 1 );


The trace show:

Code:
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   MERGE JOIN
   2    1     SORT (JOIN)
   3    2       TABLE ACCESS (FULL) OF 'AOL_EURO'
   4    1     SORT (JOIN)
   5    4       VIEW OF 'VW_NSO_1'
   6    5         FILTER
   7    6           SORT (GROUP BY)
   8    7             TABLE ACCESS (FULL) OF 'AOL_EURO'




Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         32  consistent gets
          0  physical reads
          0  redo size
        618  bytes sent via SQL*Net to client
        499  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
          2  rows processed

And the output is:

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office


So I would use the first query.
Regards
Allan
Icq: 346225948
! If you can't prove it - don't bother try to teach me !
 
Thanks Alan ..... I will try the first Query.
 
select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A aol_euro B where a.col_c = b.colc amd a.rowid <> b.rowid;


I tried to remain child-like, all I acheived was childish.
 
Hi Jimbopalmer

If there are few rows in table, then your statement is almost as fast as my query. But if table has many rows, then your statement is much more expensive, because your statement needs to make a full table scan compare. My query also makes full table scan but only compare the few double rows.

Your query is:

select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A, aol_euro B
where a.col_c = b.col_c
and a.rowid <> b.rowid;


And the trace output is:

Code:
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   MERGE JOIN
   2    1     SORT (JOIN)
   3    2       TABLE ACCESS (FULL) OF 'AOL_EURO'
   4    1     FILTER
   5    4       SORT (JOIN)
   6    5         TABLE ACCESS (FULL) OF 'AOL_EURO'




Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         32  consistent gets
          0  physical reads
          0  redo size
        618  bytes sent via SQL*Net to client
        499  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
          2  rows processed

And the output is:

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0002  09910001 280001 Home Office
0001  00010001 280001 Home Office


Please note that the output of rows is in reverse order – why? – because Oracle needs to make a full table scan compare of equal sign and after needs to compare not equal sign. (Scan is done by RowId order – so it is reverse of insert order).
Regards
Allan
Icq: 346225948
! If you can't prove it - don't bother try to teach me !
 
Back to Allan's original two queries - since the execution plans and statistics appear identical (unless I've missed something - entirely possible!), how is one better than the other?
 
By the way, Thanks for BOTH input!!!

Best Regards
Owen
 
Hi

THANKS Carp – Unfortantly I copied-and-pasted wrong trace in my first query – here is the 2nd reply with correct trace.

And my optimal query is:

select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A, (select col_c from aol_euro group by col_c having count(*) > 1 ) Multi_c
where A.col_c = Multi_C.Col_c;


The trace show:

Code:
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   MERGE JOIN
   2    1     VIEW
   3    2       FILTER
   4    3         SORT (GROUP BY)
   5    4           TABLE ACCESS (FULL) OF 'AOL_EURO'
   6    1     SORT (JOIN)
   7    6       TABLE ACCESS (FULL) OF 'AOL_EURO'




Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         32  consistent gets
          0  physical reads
          0  redo size
        618  bytes sent via SQL*Net to client
        499  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
          2  rows processed

And the output is:

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office

Regards
Allan
Icq: 346225948
! If you can't prove it - don't bother try to teach me !
 
Oh - that DOES make more sense! Thanks for the clarification; I thought everything I knew was wrong again!
 
Allan
You have assumed there is no index on COL_C, it might be instructive to compare the 3 queries assuming there IS an index on Col_C (it should speed up all 3)

(and I had promised myself I was never going to reply to anyone using that signature...) I tried to remain child-like, all I acheived was childish.
 
Hi Jimbopalmer


Yes – I assume no index on the query that Owenyuen needs. Because this kind of query seldom has indexes – and I do not have the information – so of cause I assume no indexes. But it does not matter – because as you write – the code just runs faster with indexes.

However if we need to go beyond this and want to write save code – and I think it is the smartest then trying to help people around here – then your code is dangers – it has a logical bomb or logical bug.

If I insert one more row – so we have 3 rows of interest, like this:

insert into aol_euro values('0003', '10010001', '280001', 'Home Office');

And make a select like this:

select * from aol_euro;

I get the output:

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0001  00011574 111500 Purchases Cost Pool
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office
0001  00010002 280002 Short North
0001  00010003 280003 Affinity Deposits
0003  10010001 280001 Home Office

My code:

Code:
select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A, (select col_c from aol_euro group by col_c having count(*) > 1 ) Multi_c
where A.col_c = Multi_C.Col_c;

gives this output:

Code:
 COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office
0003  10010001 280001 Home Office

It still good, but your code:

Code:
 select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A, aol_euro B 
where a.col_c = b.col_c 
and  a.rowid <> b.rowid;

generates this output:

Code:
COL_A COL_B    COL_C  COL_D
----- -------- ------ ------------------------------
0002  09910001 280001 Home Office
0003  10010001 280001 Home Office
0001  00010001 280001 Home Office
0003  10010001 280001 Home Office
0001  00010001 280001 Home Office
0002  09910001 280001 Home Office

And your code false…..

You do not like my signature? – well maybe I use this signature because I am tired of meeting people who just come up with statements out of the blue air – without documentation and never want to or can prove these statements. I try hard to be scientific and try hard to test my statements, so I do not give false assumes and false statements. It is only possible for others to attack my statements if I also delivers the SQL and proves. Only with an open mind is it possible to discuss statements.
Regards
Allan
Icq: 346225948
! If you can't prove it - don't bother try to teach me !
 
Allan: isn't it a bit simplier:


select A.col_a, a.col_b, a.col_c, a.col_d
from aol_euro A
where exists(select 1 from aol_euro b
where A.col_c = b.Col_c and a.rowid<>b.rowid)

Regards, Dima
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top