Initial Load Strategies

jtamma · Jul 10, 2008

In my company we are looking for the best strategy for our initial load to the datawarehouse. Will appreciate if people can share different strategies that they have used to accomplish this goal. We are moving data worth 3 years and we have something like this..

Tab1 - 21 Million
Tab2 - 28 Million
Tab3 - 122 Million
Tab4 - 65 Million
Tab5 - 65 Million

Tab 1 ( one to many ) Tab 2
Tab 2 ( One to Many ) Tab 3
Tab 3 ( many to one ) Tab 4
Tab 3 ( Many to one ) Tab 5

Main Target table is at the Tab 3 level. There are other targets fed from other tables but this target is the most difficult to load as it needs some columns from all the 5 tables.

What would be the best strategy for one time load for this target. I tried joining these 5 tables at the source database and then put a date condition on Tab 2 (where create date is between d1 and d2) types, to create smaller loads.However even one month data is taking too long. What would you recommend ?

Is sorting and merging in files is better alternative or moving the source tables to another sataging database and then having a large temp segment to merge and sort to achieve the join ?

Hans63 · Jul 11, 2008

First thing that comes to mind is:
how long is too long? You are talking about an intial load, so it may be an option to let that processing take a few days and meanwhile capture (and save) the first incremental loads, to be processed immediately after the initial load.

How many fields of the tables have to be put together in the target? Can you get rid of all fields you don't need?
Do you need ALL records from tab1, tab2, tab4, tab5? or can you discard some of them?

I would probably make an intermediate resultset of the join of tab1 and tab2

What I have done at some time is create an intermediate resultset, with the same structure as the target and use some updates to get all columns right.
In the first step I would join tab3 to tab4 and tab5, next update the intermediate table with the values from tab1 and tab2.
Whether this will work, is dependend on the available resources.

jtamma · Jul 11, 2008

Thanks Hans63,

How long = for one month worth of data where I just moved few fields from all the tables (not all the fields that I would need in actual load) - mapping took 5 and half hours to finish and moved 5 million rows. I used date between (d1 and d2)on tab2 to get one month. I will try your suggestion of joining Tab3, Tab4 and Tab5 and then updating with Tab1 and Tab2.

I will be moving this data to ODS tables as well as to Star Schema. Should these tables be partitioned in both the places..Will it help for DW performance later ?

MDXer · Jul 11, 2008

If you have the storage spac available you might look at staging your data for the intial load. Export and Load 1 month of data from each table to a seperate set of tables. No joins just a straight table to table copy of the columns you need. You can index these tables as needed to support the joins required. load from these stage tables.

Paul
---------------------------------------
Shoot Me! Shoot Me NOW!!!
- Daffy Duck

johnherman · Jul 11, 2008

Despite that you have identified referential integrity between several of the tabs, loading of data is best accomplished with a minimum of keys present. It's probably better to load the data, then build the relationships and key structure. So, each table can be loaded initially independent of the others. If your data is clean, you should not get problems with the indexing and linkage. If you do, you would've had the same problem loading with the index in place so you still need to fix the data. But in the scenario I suggest, you have already loaded the data, you only need to build the relationships.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

Hans63 · Jul 14, 2008

Hi JTamma,

have you looked at the access-paths for your queries? In Oracle you can ask the database for information on the execution plan and use that to identify why your query is taking so long (e.g. does it make full table scans, does it use a logical starting point or are you missing an index). I don't know if other RDBMS's have similar features.
But I think that in 5.5 hours you should be able to load much more than 5 million rows. (taken that your hardware is able to deal with the amount of data)

When you try the update solution, make sure to disable as much logging as possible as it would really slow the process.

Good luck,
Hans

jtamma · Jul 14, 2008

Thanks to all of you for your suggestions. It looks like we can get extra storage to stage all tables as it is and I would be able to use some of these suggestions.

Hans63 - I checked my query and it is using proper index and is being driven by the table that has the filter on dates. Still it needs three nested loops. No one here has a clue how to make this faster now (without staging tables).

Hans63 · Jul 14, 2008

Hi Jtamma,

As you probably know, nested loop are killing your performance. It can be very hard to find out why exactly the RDBMS is choosing this particular strategy. It may have to do with (still) a missing index, but it may also depend on some RDBMS parameters. Ihave been puzzling on similar situations for quite osme time. It made me feel like I don't understand at all how a RDBMS works.

Extra storage may provide you some breathing space, but does not imply you performance problem has gone.
really solving this calls for intimate knowledge of the specific datamodel, database parameters and some creativity.
From your previous posts I think you should be able to come up with a good solution.

Succes.

jtamma · Jul 15, 2008

Problem here is DBAs think that it is the best strategy Oracle can choose in the situation and no one is working on ways for how to avoid nested loops. Till now I never bothered to look into the Optimizer world and now find myself digging into oracle performance manual.
Anyways Thanks Hans63 - for your moral booster comment. I will update the post for others benefit if some smarter solution comes up.

Hans63 · Jul 15, 2008

Jtamma,

Try to let the DBA explain to you what a nested loop really does. Maybe he can see why nested loops on these amounts of records are bad. (On small sets, it is a viable solution.)

If that does not work, create a simple query, that you can easily change so that the query plan changes from nested loops to index scan (or even a full table scan), hash-join, or some other join. Maybe you can look at the asktom website for samples on this.

http://asktom.oracle.com

Good luck with this situation.

jtamma · Jul 15, 2008

Can't thank you enough Hans63. Your comment that Nested loop is the problem, gave me the confidence to go ahead and play around with my newly acquired knowledge on optimization. Even before I read your latest reply, I reduced time to 45 min from 6 hours with hash join and with Merge hint to 10 minutes.

Thanks.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Initial Load Strategies

jtamma

IS-IT--Management

Hans63

Programmer

jtamma

IS-IT--Management

MDXer

Technical User

johnherman

MIS

Hans63

Programmer

jtamma

IS-IT--Management

Hans63

Programmer

jtamma

IS-IT--Management

Hans63

Programmer

jtamma

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor