Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

need solution for IIel load late arriving dimension

Status
Not open for further replies.
Oct 22, 2015
4
BE
Hello,

Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.

I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).

The trouble lies in

- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.


We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….

I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution
and then using this intermediate table to load the facts/relationship tables.

So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.

However, my solution was turned down by the architect :( and i am now looking for alternates…..

by experience do you know one apart from

- Having a separate flow for the common table/lookup
- serializing the ETL flows….

Thanks
 
You can load the fact record with a "Late Arriving/Unknown" value for the Dimension. Then, at some later time, maybe the next day, or when you receive notification that the dimension job has completed, run a separate "Late Arriving" process. That process will go across the fact table looking for rows that have the value of "Late Arriving" and replace it (UPDATE the record) with the proper dimension value.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile innovative cognitive big data clouds)


 
Thanks for the response. However, the decision to apply the design pattern for late arriving dimension is already taken (create the reference/lookup/super type records for late arriving dimension using the fact (Events/Transactions....) records).
 
I see. Due to the massively parallel of Netezza, you probably need to load all the dimensions before the facts (if practical). If not, you need to load as many as you can, then load the fact, then do the Late Arriving Cleanup process. Another method i have used is to let the duplicates happen, then have a Duplicate Resolution process to cleanup.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile innovative cognitive big data clouds)


 
As the cleanup based on the duplication resolution process is expensive, we want to avoid it in the first place.....
 
If you want to avoid the rework, then design the process so that no rework is involved.

This is a situation that arises in Project Management quite often. You have time/schedule, scope/quality, and budget/cost. These form the "iron triangle" of Project Management. If you want to reduce one leg of the triangle, one or both of the others will grow. I don't see how this situation is any different. If you want to save time, you're going to increase scope (duplicate processing) or increase computing power (cost). Discuss this among your team and decide what's most important: time, quality, or money. Or perhaps some intermediate compromise.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile innovative cognitive big data clouds)


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top