Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Incremental Load 1

Status
Not open for further replies.

JBroadwater

Programmer
Dec 4, 2009
4
0
0
US
I initially understood Load (the L part of ETL -- Extract, Transform, Load) to mean copying prepared (complete, not partial) fact+dimension tables from a stage area to a presentation area.

However, the concept of an incremental Load, appears to conflict with this understanding. I.e. you can incrementally update fact/dim tables in your staging area, but how can you then refresh your presentation area incrementally? A load of the presentation area implemented as a swap out of the most recent partition of a fact table is one way to resolve this apparent conflict, but frankly, I don't see any way other than swapping a partition (swapping an entire table isn't really incremental is it?). Maybe incrementally updating a stage area fact table, then using an RDBMS/Oracle replication facility to apply the same incremental update to the corresponding presentation area fact table?

I.e. replacing some rows in a fact or dimension table already sitting on a presentation server would seem to violate the principle of stage / presentation area separation. Adding or changing rows in a fact or dimension table sure sounds like a purely staging area activity to me.

So, to be clear, I am incrementally updating dim/fact tables in my staging area, but then I'm copying the entire tables from stage to presentation areas. And I'm wondering if other people have a different understanding or can suggest a better approach.
 
Typically, you do not purge and replace an entire dimension or fact table. If you have a product dimension, and someone creates a new product, you will incrementally load that new product into your presentation area.

There's a few variations on how to do this. Here is how I do it:

1. An ETL package performs the "E" or EXTRACT from the source system to the staging area. The staging area only holds what was incrementally extracted from the source system -- not a persistent copy.

1a. (Optional). An ETL package (I am using "ETL package" as a catch-all for a tool-generated package or a reused SQL script) will load incremental data into the persistent copy of the data in the ODS.

2. Another ETL pacakge performs the "T" or transformation of the data from the staging area or the ODS. This can be a transformation by simple SQL statements (joins or RTRIMs to scrub data for example), or it could employ more advanced transformations in the ETL tool itself. But the basic idea is that the transformation is performed in memory. I don't transform it, write it to disk and then load it.

3. The same package described in #2 takes the transformed data and loads it into facts and dimensions. It doesn't matter if it is loading an incremental batch or an entire table's worth of data.

Yes, I know that some keep a persistent copy in "Staging." And yes, some use ODS for a snowflaked version of a Star Schema (basically data that is already somewhat transformed).

But the important thing is to pick a good methodology which works for your company and to stay consistent. It's not necessary to comply with book-definitions of Acronyms to employ a successful data warehouse. But those definitions are very important as a guideline. The size of your data and your company environment will dictate whether or not you do exactly what "The Data Warehouse Toolkit" suggests for example. Some companies have the resources to have more layers that I've described above. Some don't. But the one thing for certain is that you should NEVER need to rebuild entire dimensions and fact tables for each load.
 
Thank you RiverGuy.

I do not rebuild entire dimensions or facts for each load. I do however keep 2 copies of dimensions and facts: one copy in a staging area which is off limits to end users, and a second copy in a presentation area which is available to end users.

Perhaps I should have asked: do you keep two copies of your dimension and fact tables, or do you keep only one copy and directly update the dim and fact tables in your presentation area.
 
I only keep one copy. But I still think you are rebuilding them. If you have an off-limits copy which you update incrementally, and you don't update the presentation copy incrementally, then how would the presentation copy ever get updated if you don't purge and replace the entire thing?
 
Thanks. Yes, I take your point. I did leave out one part of my architecture/design that is significant: I only copy/move parts(partitions) of my stage area tables to the presentation area. But the copies to the presentation area are unequivocally purge and replace.

I gather it is common then, for warehouse engineers to think of transform processes as limited or contained to a staging area, but that the last step of incrementally loading transformed dimensions and facts happens across the stage / presentation boundary (this last load step is not confined to a staging area). And consequently that at some point, end users would either have access to a fact table that is being updated (in flux) or that updates happen during off hours or during times when end users are excluded.

We also like to run verifications on our dimension/fact tables to certify they are correct before we make them available to end users. Our (perhaps too) conservative approach gives us a chance to do that.

It seems like directly, incrementally updating dimension and fact tables in end-user-accessible presentation areas provides a greater opportunity to accidentally expose data errors to end users (e.g. in the event that the last incremental load step is incorrect or buggy). But really this is just a question of where and when one concentrates QA/testing efforts. A rock-solid load step helps prevents data errors, just as a post-load data verification does.

I appreciate your comments.
 
I gather it is common then, for warehouse engineers to think of transform processes as limited or contained to a staging area, but that the last step of incrementally loading transformed dimensions and facts happens across the stage / presentation boundary (this last load step is not confined to a staging area). And consequently that at some point, end users would either have access to a fact table that is being updated (in flux) or that updates happen during off hours or during times when end users are excluded.

This all depends on many points of the architecture. If you do nightly loads during off-hours, you can safely lock a table and perform your incremental updates. The user would never see the in-flux state since the table would be locked in presentation.

If you do near-real-time loads, it depends on how you define "in flux." You obviously cannot lock a table during business hours. You could load row by row. So is it really in-flux? Or is it just a sequential representation of data that also changed sequentially in the source system?

In other cases, like my particular case, the presentation layer is SQL Server Analysis Services--so a cube. Since the users do not touch the relational database, the cube is processed and they never see an in-flux state.

It seems like directly, incrementally updating dimension and fact tables in end-user-accessible presentation areas provides a greater opportunity to accidentally expose data errors to end users (e.g. in the event that the last incremental load step is incorrect or buggy). But really this is just a question of where and when one concentrates QA/testing efforts. A rock-solid load step helps prevents data errors, just as a post-load data verification does.

This again depends on the requirements and architecture. I've been in a situation where most analysis was done on a daily, total aggregate basis on down. So if I encountered errors, I would not load any of the data into the presentation layer until resolved. In other cases, it would be ok to have partial data if total aggregations were not important, and errors were logged to special error tables to be picked up and reloaded later.
 
We also have two "copies" of the data, an Interactive Data Store, which is like an ODS and which is off-limits to end-users (except some power users), and the actual warehouse, which is prepared for end-user access. Kimball likens these to the "kitchen" and the "dining room". Only the staff eats in the kitchen. The guests use the dining room.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
This is the approach I have always taken. The operational area (ODS) contains a complete copy of all data that has ever been loaded into the warehouse and so it does contain copies of all the data that the user accesses. The user data is usually in a separate area, ie a "datamart", and how you get the latest data from the ODS to the datamarts depends on a number of factors including complexity, data volumes, batch windows, etc.

One direct method of achieving this is to build views that have the same columns as the datamart dimension but only show data that is new or has changed since the dimension was last updated. It is then relatively straight forward to create an incremental load for the dimension table.


John
Online Business Intelligence
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top