Comparing two very large tables and returning ONLY differences?

knowlgo · Dec 31, 2004

I have two tables with same %rowtype. They're about 1.3 million rows each.

I want to return into a third table the column name, and the data that is different (comparison). I don't want to have to loop through each column on each table so I created a minus scenario that returns rows on each table that have been updated, actually, that are different in comparison. From that I want to go through and return only changed values and column names.

This is going to be a fairly complex query. I'm working on a good solution, but want to know if anybody's done this before or knows of a good source for running comparisons of large data that returns non-numeric values?

Thank you.

SantaMufasa · Dec 31, 2004

Knowlgo,

I know that one of your major concerns is the 1.3 million rows in each of your tables. Luckily, Oracle's set operators (UNION, MINUS, and INTERSECT) reflect the fastest per-row processing speeds of any of Oracle's code except for a standard full-table scan.

Provided that you have an index on the Primary Key columns of both your TAB1 and TAB2, I believe I have optimised the code for quickest execution to meet your specifications.

Since you didn't post the DESCRIBEs of your TAB1 and TAB2 tables, I used Oracle Education's EMPLOYEE tables to simulate your TAB1 and TAB2 formats and values; (TAB1 has two rows [26, 27] that are not in TAB2; TAB2 has two rows [28, 29] that are not in TAB1; rows 4 and 5 in both tables differ from one another in several columns):

Section 1 -- TAB1 and TAB2 sample values:

Code:

set echo on
col id heading "ID" format 99
col last_name format a12
col first_name format a10
col userid format a8
col start_date format a9
col comments format a10
col manager_id heading "Mgr|ID" format 99
col title format a20
col dept_id heading "Dept|ID" format 99
col salary format 99,999
col commission_pct heading "Comm|Pct" format 99.99
select * from tab1;

                                                          Mgr                      Dept           Comm
 ID LAST_NAME    FIRST_NAME USERID   START_DAT COMMENTS    ID TITLE                  ID  SALARY    Pct
--- ------------ ---------- -------- --------- ---------- --- -------------------- ---- ------- ------
  1 Velasquez    Carmen     cvelasqu 03-MAR-90                President              50   2,500
  2 Ngao         LaDoris    lngao    08-MAR-90              1 VP, Operations         41   1,450
  3 Nagayama     Midori     mnagayam 17-JUN-91              1 VP, Sales              31   1,400
  4 Quick-To-See Mark       mquickto 07-APR-90              1 VP, Finance            10   1,450
  5 Whipsnap     Audry      awhipsna 04-MAR-90 Name chang   2 VP, Admin.             10   1,600
                                               e via marr
                                               iage
 26 Hunt         David      DHUNT    31-DEC-04              1 VP-IT                  10   2,500
 27 Go           Knowl      KGO      31-DEC-04              1 VP-Development         10   2,500

select * from tab2;

                                                          Mgr                      Dept           Comm
 ID LAST_NAME    FIRST_NAME USERID   START_DAT COMMENTS    ID TITLE                  ID  SALARY    Pct
--- ------------ ---------- -------- --------- ---------- --- -------------------- ---- ------- ------
  1 Velasquez    Carmen     cvelasqu 03-MAR-90                President              50   2,500
  2 Ngao         LaDoris    lngao    08-MAR-90              1 VP, Operations         41   1,450
  3 Nagayama     Midori     mnagayam 17-JUN-91              1 VP, Sales              31   1,400
  4 Quick-To-See Mark       mquickto 06-APR-90              1 VP, Finance            10   1,460  10.00
  5 Ropeburn     Audry      aropebur 04-MAR-90              1 VP, Administration     50   1,550
 28 Bridges      Dave       DBRIDGES 31-DEC-04              1 VP-Marketing           10   2,500
 29 Seminikhyn   Dmytro     DSIMINIK 31-DEC-04              1 VP-Design              10   2,500
******************************************************************************************************

I created three tables to hold differences:
"Row_IDs_In_1_not_2", "Row_IDs_In_2_not_1", and "diffs". To facilitate your testing, the following scripts DROP, CREATE, populate, then SELECT from these tables:

Section 2 -- Script to handle all processing:

Code:

REM ***********************************************************************
REM Drop tables relating to table differences (to allow re-running entire
REM    script with empty "diff" tables).
REM ***********************************************************************
drop table Row_IDs_In_1_not_2;
drop table Row_IDs_In_2_not_1;
drop table diffs;
REM ***********************************************************************
REM Re-create tables relating to table differences
REM ***********************************************************************
create table Row_IDs_In_1_not_2 (id number);
create table Row_IDs_In_2_not_1 (id number);
create table Diffs
	(id		number
	,column_name	varchar2(30)
	,tab1_value	varchar2(4000)
	,tab2_value	varchar2(4000)
	);
REM ***********************************************************************
REM Identify Primary Keys of rows in Tab1 but missing from Tab2
REM ***********************************************************************
insert into Row_IDs_In_1_not_2
	(select id from tab1 minus select id from tab2);
REM ***********************************************************************
REM Identify Primary Keys of rows in Tab2 but missing from Tab1
REM ***********************************************************************
insert into Row_IDs_In_2_not_1
	(select id from tab2 minus select id from tab1);
REM ***********************************************************************
REM Compares columns in only rows with matching IDs but differing column values
REM ***********************************************************************
declare
	t1	tab1%rowtype;
	t2	tab2%rowtype;
begin
	for r in (select id from -- identifies matching IDs but rows that differ
			(select * from tab1
				where id not in
					(select * from Row_IDs_In_1_not_2)
			minus
 			select * from tab2)) loop
		select * into t1 from tab1 where id = r.id;
		select * into t2 from tab2 where id = r.id;
		if nvl(t1.LAST_NAME,' ') <> nvl(t2.LAST_NAME,' ') then
			insert into diffs values (
				t1.id,'LAST_NAME',t1.LAST_NAME,t2.LAST_NAME);
		end if;
		if nvl(t1.FIRST_NAME,' ') <> nvl(t2.FIRST_NAME,' ') then
			insert into diffs values (
				t1.id,'FIRST_NAME',t1.FIRST_NAME,t2.FIRST_NAME);
		end if;
		if nvl(t1.USERID,' ') <> nvl(t2.USERID,' ') then
			insert into diffs values (
				t1.id,'USERID',t1.USERID,t2.USERID);
		end if;
		if nvl(t1.START_DATE,sysdate-2000000) <>
			nvl(t2.START_DATE,sysdate-2000000) then
			insert into diffs values (
				t1.id,'START_DATE',t1.START_DATE,t2.START_DATE);
		end if;
		if nvl(t1.COMMENTS,' ') <> nvl(t2.COMMENTS,' ') then
			insert into diffs values (
				t1.id,'COMMENTS',t1.COMMENTS,t2.COMMENTS);
		end if;
		if nvl(t1.MANAGER_ID,-1) <> nvl(t2.MANAGER_ID,-1) then
			insert into diffs values (
				t1.id,'MANAGER_ID',t1.MANAGER_ID,t2.MANAGER_ID);
		end if;
		if nvl(t1.TITLE,' ') <> nvl(t2.TITLE,' ') then
			insert into diffs values (
				t1.id,'TITLE',t1.TITLE,t2.TITLE);
		end if;
		if nvl(t1.DEPT_ID,-1) <> nvl(t2.DEPT_ID,-1) then
			insert into diffs values (
				t1.id,'DEPT_ID',t1.DEPT_ID,t2.DEPT_ID);
		end if;
		if nvl(t1.SALARY,-1) <> nvl(t2.SALARY,-1) then
			insert into diffs values (
				t1.id,'SALARY',t1.SALARY,t2.SALARY);
		end if;
		if nvl(t1.COMMISSION_PCT,-1) <> nvl(t2.COMMISSION_PCT,-1) then
			insert into diffs values (
				t1.id,'COMMISSION_PCT',t1.COMMISSION_PCT,t2.COMMISSION_PCT);
		end if;
	end loop;
	COMMIT;
end;
/
REM ***********************************************************************
REM Displays row IDs that reside in Tab1 but not in Tab2
REM ***********************************************************************
col id heading "Ids of rows|in Tab1 but|not in Tab2" format 9999999999
select * from Row_IDs_In_1_not_2;
REM ***********************************************************************
REM Displays row IDs that reside in Tab2 but not in Tab1
REM ***********************************************************************
col id heading "Ids of rows|in Tab2 but|not in Tab1" format 9999999999
select * from Row_IDs_In_2_not_1;
REM ***********************************************************************
REM Displays row IDs, Column names, and Tab1 versus Tab2 values for
REM     differing columns only
REM ***********************************************************************
set linesize 200
col id format 9999999999
col column_name format a14
col tab1_value format a25
col tab2_value format a25
select * from diffs;

Section 3 -- Transcript of results:

Code:

SQL> select * from Row_IDs_In_1_not_2;

Ids of rows
in Tab1 but
not in Tab2
-----------
         26
         27

2 rows selected.

SQL> select * from Row_IDs_In_2_not_1;

Ids of rows
in Tab2 but
not in Tab1
-----------
         28
         29

2 rows selected.

SQL> select * from diffs;

Ids of rows
in Tab2 but
not in Tab1 COLUMN_NAME    TAB1_VALUE                TAB2_VALUE
----------- -------------- ------------------------- ------------------
          4 START_DATE     07-APR-90                 06-APR-90
          4 SALARY         1450                      1460
          4 COMMISSION_PCT                           10
          5 LAST_NAME      Whipsnap                  Ropeburn
          5 USERID         awhipsna                  aropebur
          5 COMMENTS       Name change via marriage
          5 MANAGER_ID     2                         1
          5 TITLE          VP, Admin.                VP, Administration
          5 DEPT_ID        10                        50
          5 SALARY         1600                      1550

10 rows selected.

Let us know if all this resolves your need.

[santa]

Mufasa
(aka Dave of Sandy, Utah, USA)

http://www.dasages.com

@ 00:51 (01Jan05) UTC (aka "GMT" and "Zulu"),
@ 17:51 (31Dec04) Mountain Time

SantaMufasa · Dec 31, 2004

And sorry...I just noticed that the column heading for the last SELECT was incorrect...it should have read simply "ID" (and not "Ids of rows in Tab2 but not in Tab1".

Happy New Year,

[santa]

Mufasa
(aka Dave of Sandy, Utah, USA)

http://www.dasages.com

@ 01:18 (01Jan05) UTC (aka "GMT" and "Zulu"),
@ 18:18 (31Dec04) Mountain Time

knowlgo · Jan 1, 2005

Wow that's awesome Dave. Only one glitch in that. If I return a row for every change on 1.3 million rows, I'm going to have a hell of a table!

Im thinking I can return those values into fields in rows rather than columns? What do you think?

-G

SantaMufasa · Jan 1, 2005

G,

Perhaps I should have obtained answers to a few questions before I set out on this adventure:

1) What kind of volume do you expect in terms of changes? (Might all 1.3 million rows differ?)
2) How do the tables become out of synch in the first place? (If keeping both tables is unavoidable, shouldn't you have one or more triggers that keep the tables synchronised?)
3) Which table's data takes precedence? (That is, which tables's data should software write to the other table?)
4) What the heck do you plan to do with all those documented changes?

If you really have a business need to track the differences, then there aren't too many alternatives that I can see to do what you want besides the way, above.

Let me know,

[santa]

Mufasa
(aka Dave of Sandy, Utah, USA)

http://www.dasages.com

@ 16:58 (01Jan05) UTC (aka "GMT" and "Zulu"),
@ 09:58 (01Jan05) Mountain Time

knowlgo · Jan 1, 2005

Happy new year Dave!

I am expecting no more than . . .5 changes per row.

I have already paired down my tables to ~144K rows combined, which is a start.

The business need is, the boss says "I want the differences between these tables, and only the differences." So I create a script that does something similar to what you did and he says "You're going to return a column for each change??????!!!" And I say "Ha ha! its a joke! Everyone laugh at the funny joke! ha ha haaaaa. . . Of course not! I was just testing to see if you were awake mine capytan!
The REAL script isn't finished yet!

But seriously, this project is just killing me. Not only am I in charge of creating and loading, but in between these million row loads, I have to run scripts to compare old v. new.

knowlgo · Jan 1, 2005

That should say "return a row for each change" not "column.

-G

SantaMufasa · Jan 1, 2005

So what you're telling me is that the boss thinks that data is fine horizontally, but is unacceptable when you show it vertically? I would point out to the boss that his "row phobia" is a bit neurotic.

Now, as far as my questions are concerned, you have hinted at answers for question 1...you are 25% done with your assignment. I believe we will benefit from answers to the other three questions.

Also, among your boss's responsibilities (if he is going to "slave drive" you on New Year's Day) is 1) what does s/he to do with the data? Does s/he plan to READ the 1.5 million data points that you will be generating? [144,000 * 5 changes * 2 (old/new)]) and 2) if he has format biases, what are they? (e.g. "I don't like reading data vertically...I want it horizontally.") [...which, again, I believe is an unreasonable phobia.]

So, please offer answers to all the questions I posed. The answers may uncover the fact that your manager is just [unreasonably] making you jump through hoops here if s/he cannot justify a business need that offsets the resources involved in producing the results.

Cheers,

[santa]

Mufasa
(aka Dave of Sandy, Utah, USA)

http://www.dasages.com

@ 17:58 (01Jan05) UTC (aka "GMT" and "Zulu"),
@ 10:58 (01Jan05) Mountain Time

knowlgo · Jan 1, 2005

What we have is data that is generated here, sent to another company, revised for accuracy and sent back. Then we compare what changed with teh new vs. the old. Neither one is more important. Just the Delta so we know what is in error and what is not.

What they plan on doing is looking at them and then when the next run comes saying "hey, tell us what's different." Then me being the good dog I am, will run and fetch the slippers.

knowlgo · Jan 1, 2005

Ok, thanks Mufasa, I just got finished talking him into doing it your way. Thanks for your help and have a great 2005.

-G

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Comparing two very large tables and returning ONLY differences?

knowlgo

MIS

SantaMufasa

Technical User

SantaMufasa

Technical User

knowlgo

MIS

SantaMufasa

Technical User

knowlgo

MIS

knowlgo

MIS

SantaMufasa

Technical User

knowlgo

MIS

knowlgo

MIS

Similar threads

Part and Inventory Search

Sponsor