Iamundercover
MIS
Hello,
Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.
I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).
The trouble lies in
- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.
We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….
I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution
and then using this intermediate table to load the facts/relationship tables.
So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.
However, my solution was turned down by the architect and i am now looking for alternates…..
by experience do you know one apart from
- Having a separate flow for the common table/lookup
- serializing the ETL flows….
Thanks
Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.
I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).
The trouble lies in
- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.
We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….
I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution
and then using this intermediate table to load the facts/relationship tables.
So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.
However, my solution was turned down by the architect and i am now looking for alternates…..
by experience do you know one apart from
- Having a separate flow for the common table/lookup
- serializing the ETL flows….
Thanks