Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

A couple of questions about populating a Data Warehouse

Status
Not open for further replies.

d000hg

Programmer
Jun 2, 2007
6
GB
Hi there. I'm new here so if my questions are already answered please let me know.

Anyway, two questions about loading data into a DW from the operational process databases:

1)Do we normally update the DW each time the operational DBs are modified? i.e when a sale is made would the DW be updated, or is it normally a process run onece a day/week/etc (I beleive the first)

2)If I have an operational DB and want to perform my ETL to update the DW each night, how do I stop the same data being added every day? For instance if on Monday the operational DB has 100 records, and 100 more are added on Tuesday, how does the DW update on Tuesday night know that only the new records should be added?

Thanks for any help.
 
1) This is dependend on your specific situation. What do you want in your DW (like: what grain of your facts. Do you want all transactions, or is a daily load enough, in which case you loose records which are created and deleted within one day).
What is available to you. You may be dependend on some external organisation for some of your sources. Not all external organisations are willing to comply with your needs.

So, both variations occur.

And if you mean: do you have a batch load or streaming ETL. Same conclusion, both variations occur.

2) If you have any influence or the source database, you may be able to use the DBMS features to find the changes since a specified moment in time (e.g. last run).
If you do not, you have to compare the 2 datasets with each other, using a full outer join, to find the new, changed and deleted records. If your DBMS has it available, you might use a hashvalue to decide whether records have changed. We have to perform this operation on a daily base.
We compose a string from all non-key fields in a record and compute a hashvalue on this. This hashvalue is stored in our DW so we can compare the hashvalue in order to decide if a record has changed or if all fields are still the same.
 
I'll expand a little on what Hans63 has already listed.

1. The frequency of update of the DW from transaction systems is a user requirement. But be flexible when you design the solution as DW's that were weekly could change to daily or hourly after the end-users start fully utilizing the capabilities of the DW.

2. There are several ways to handle this. First, you need to determine if the prior version of the data record has value. If yes, you need to keep one or more verions of the particular record (whether dimension or fact). If prior versions are not important, you simply overwrite the old version with the new version.

If you have no available "change" field(s) from the source system, you can use a "Checksum" approach to determining if the record has changed. Checksum is too detailed to go into here. However, if you have the "change" fields, such as "last updated on" or "last change timestamp", you can use those fields to only apply the changed records to the DW. I believe Hans63's hashvalue is the same as Checksum - checksum is the industry standard term.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Well, I think checksum and hashvalue are equivalent. Hash is termed by Oracle and Business Objects.

db000hq: Coming back to your second question, also be aware that source side dependencies on fields like "last_update" or "last_change_timestamp", etc. may cause you NOT to get 100% of all changes, because someone may disable or bypass setting these fields on the source system (even if they are managed by triggers in the database, as they can often be disabled).
So make sure you have some idea of a fall-back plan in case this happens.
 
2/ in case of using oracle you can use the merge statement which is, in case record doesn't exist insert one, else update the existing one.
Most probably other databases have simular thing.
Anyways, you always have a primary key on which you can join.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top