Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Kimball's recommendation for staging. 1

Status
Not open for further replies.

jtamma

IS-IT--Management
Oct 3, 2007
24
IN
I am designing ETL and really confused about the whole staging architecture. Please provide me whatever links you can, I am ready to read anything. My questions are as follows.
(1)
Staging area between the source and the data Mart (backroom as they call it), is it supposed to be a persistent staging area or a non persistent staging area or does it depend on the requirement of the businees. Does Kimball say anything about it? I did lot of reading but could not get it clearly. Except I guess he recommends persisting confirmed dimensions in the staging.

If it depends on the requirements what are the cases where you will persist structures in the staging (meaning tables will not be wiped out with each load), apart from confirmed dimensions.

If I decide to persist my source information with a slightly different ER model than source (still 3NF)in staging area, will I be deviating too much from Kimball Or in Kimball's terminology this is called an ODS.

People who are doing Kimball stype of datawarehousing, do they ususally have
(1) Source -> ODS -> Data Mart (OR)
(2) Source -> Trasient Staging -> Data Mart (OR)
(3) Source -> Persistent staging -> Data Mart.
 
If you are ready to read anything, perhaps you can try something out of the DWH field (like the Kama Sutra or the hitchkiers guide to the galaxy or the bible or ... ;-) )

What I have used for staging:
source -> staging -> ODS -> datamart

the staging area was kept for a few days for some parts (like the tables where the source data was imported), not kept at all (the table where the delta's were derived)

The datamodel of our ODS looked much like the source systems (3NF), with some conforming added. Like the codes used for sex, marital state, etc.

The reason we kept the data in the first part for several days is robustness. You cannot allow for one day of data not to be processed. So should the ETL process fail, you must be able to correct the reason why it failed and restart the ETL process again.
As this may take more than one day (this usually happens on friday) we had to be able to keep the data for a few days. We kept them in the database, because of performance. Importing the data could take several hours. So if you do not have to import again, it is easier to catch up with the normal schedule.

The Delta's where derived by comparing the imported data with the ODS. The results were stored in a seperate set of tables. These tables were emptied at the start of this ETL-phase. Not afterwards because it can be a valuable asset in tracing why the ETL process failed.
The Delta tables consisted of only the primary keys of 2 consecutive loads. This was to keep the recordsize to a minimum, allowing more records to be kept in memory.

Between ODS and Datamart we used transient tables for storing intermediate results.
This phase consisted of one part where intermediate results where gathered possibly from several sources, followed by the actual loading of the datamart.
 
Hans63, have a star for your reading suggestions and for letting the secret of your dw experites out. :)

Thanks for sharing your staging strategy too. This makes things little bit clear. Can you say something about staging and delivering dimensions. How do you handle them in your environment. Specially confirmed dimensions, correlated dimensions and hierarchy in dimensions.

Hierarchy in dimensions are flattened out in my datamart. Is it a good idea to persist one copy of hierarchy tables in normalised form in staging (kind of snowflaked versions) . Does it helps to do cosistent updates to the flattened dimension, if hierarchy's attributes are changed.

I guess confirmed dimensions are always permanently staged (if not let me know) . My dw team is also responsible for coming up with confirmed dimensions and for maintaining it in future. We have identified all the important dimensions that should be confirmed ( they are kind of all the dimensions of my datamart). I have the same confusion there, should I be maintaining hierarchies as rolled up dimensions in confirmed schema so that other datamarts can use it.

On the same note I have two big dimensions that are highely correlated (1:1). I need just 3-4 attributes from both of them for my data mart. If I combine the two with just the attributes I need in my datamart, should I also make this smaller combined version of dimension available in confirmed dimension schema or should it be part of my datamart only. Parent dimension's full version would be in confirmed schema ofcourse.
 
My preferred method is similar to what Hans has described- source -> temporary staging tables -> ODS which resembles source system -> star schema. I have also worked in an environment where we kept a 3NF Snowflake ODS. I don't think you would NEED to have that, but your business requirements would dictate that.

I do think it is a good idea to keep a persistent data structure which resembles the source data. This makes ETL much easier should you need to tweak something or reload. It's a lot easier than storing compressed flat files, or pulling directly from the source system.
 
Jtamma,

I am not quite sure I understand your question concenring the dimensions. Let's see if I can come up with an answer anyway.

The dimensions in the datamart are persistent, because we are using slowly changing dimensions of type 2 (Kimball) and we don't want to rebuild the entire dimension every day. Certainly not with the bigger dimensions. We have considered this for the small (<100 records) dimensions, but opted for a consistent approach. This is because of maintanability.

The staging area between ODS and Datamart included a Delta determination as well. This posed no problem because the number of records in the dimensions is much smaller than those in the fact tables.
So from the ODS we build the dimension as it look today. The result was stored in a transient table, again emptied at the start of the ETL process.
Compared this with the situation from yesterday and decided what records to close and what records to add.

Hierarchies were flattened out (denormalized). We have not yet encountered the need to use a separate table for the hierarchies (and using Oracle, we can always use materialized views if the need arises).
The source for the hierarchies is the ODS, where some of the hierarchies are stored in separate tables, and some are in 1 table (the source system is denormalized here).

Doing a consistent update depends on the type of dimenion you have (and possibly what attribute is changed). In a SCD2 dimension you don't update anything apart from the closing date (if you have SCD2 defined on ALL attributes).

About your last question:
If there are reasons why the 2 big dimensions are kept apart in the conformed dimension schema (master star scheme?), I am wondering why they are put into 1 dimension in your datamart (but I do not know the business logic you have to deal with).
My first thought would be to keep it in your datamart only. But if some other datamart needs this same dimension, you should share it.
Hmmm, what I am trying to say is: deal with this in a pragmatic way

Hope this helps a bit.
 
Thanks Riverguy and Hans63,

I think key to your ETL architecture is "ODS". However I just want to know what do you mean by "ODS" in this context. I am sorry but "ODS" really confuses me and there are 1001 definitions out there.

Does "ODS" as you mentioned above have data from all the sources of datamart in normalised (or denormalised in some cases) form. So if I have to calculate delta for dimension changes today , I do not need to get a copy from datamart because one copy would be there in ODS that I can compare with the latest copy from the source system and calculate delta ? Do you also assign surrogate keys to the ODS tables or it is done only in datamart ?

I think things would be much clear if you answer this question. "If for some reason - schema containing the star schema crashes and you do not have the backup..you can construct the whole datamart again just from ODS without going to the source systems ?? Right ?"

If I design this kind of ODS, I can see this there are many transaction applications in my company that would like to use this ODS because data here would be integrated from all the sources, clean, would be normalised and would be useful because of confirmity. So do you allow other applications to use this ODS or it is part of staging that Kimball says should be accessible to only ETL. When do you purge this data - as this will grow every years same as facts will grow in datamart.

Hans63, you answered my dimension questions thanks.
 
Jtamma,

ODS as I have used it. It is a set of normalised tables, that closely resembles the source system. It contains timestamped variations of the source data. For storage reasons every record is provided a startdate as well as an enddate. We have used surrogate keys in this ODS, but so far they had no practical use. Comparison is done on the natural keys as provided by the source system.
Attributes are added if the source system is maintained. Fields are (of course) never dropped.

So yes, if your datamart crashes you can eventually rebuild it from scratch. That is, if you keep the older versions of the code that built the datamart available. Due to changes in the datamart or the ODS, you may have to use older versions of the code to rebuild the older dates.

It is quite natural, and an additional benefit for the DWH team, if an ODS can be used (and is used) for the reasons you describe. If this happens, make sure your team gets the credits.
I think, this ODS could be very useful. So no, don't keep it for ETL only. But I have not used it in this way, so there may be reasons NOT to allow other processes than ETL to it.

As to how long we keep data in the ODS. Infinitely. Our system was not so big that it outgrew technological advances in storage. The ODS is in fact rather small, as only changed data is added to it.
Compared to snapshot fact tables the size of this ODS was no problem at all.
Ultimately however, it is the business that decides, within the technologically set boundaries, how long to keep data.
 
Hans63,

Thanks for describing your "ODS". After designing my datamart and understanding the source systems my design instincts were suggesting me to have something like what you call "ODS" to feed my datamart, but I was little doubtful. When I started researching to learn the best practices in this regard, I got caught up in Kimball vs Inmon arguments. I tried to find what Kimball has to say about "this kind of ODS" but couldnt find anything. I found one article where he explained that with the ability to store data at atomic grain in datamarts "ODS to report on operational data" is not required. But nothing about this ODS in staging area.


I was not sure that if I designed this ODS with my datamart, would my architecture be close to Inmon, "a 3nf datawarehouse between datamarts and sources" except that it is called ODS and rather than having the whole enterprise in scope it will have only the data that is important for my datamart.

If future datamarts of my company not only share dimensions but this ODS as well { and we keep upgrading ODS models with new sources as required by new datamarts,} at the end we will have nothing but an Inmon model. So if we do Kimball with ODS incrementally we become Inmon :).

Why Kimball philosophy doesnt say anything about "this kind of ODS" ? (Is it because then they wont be different from Inmon ?) As per them, I know datamarts should share confirmed dimensions but how can we prevent different datamarts from extracting from the same source systems multiple times if we do not plan an ODS on a shared basis. Does he expects old datamarts to be the source for the new datamarts for the common sources ? I do not know how practical it is as damarts are subject specific and dimensional ?

Anyways, I do not know whether my message is making any sense to you or not. In one line I am saying "Thanks" and I am going ahead with "your kind of ODS" without caring whether it is Kimball's or Inmon's architecture.

One more question -

Is it a common practice to have a schema dedicated to host confirmed dimension? My thought is to have a central place where all the dimensions that are important for organisaction would be hosted and then all the datamarts would get populated by these confirmed dimension (even for the first datamart that I am designing) Is it not how you do it ? How do you maintain confirm dimensions ?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top