Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Retaining purged data

Status
Not open for further replies.

HogHunter

Programmer
Jun 4, 2004
20
US
We have a star schema data warehouse that has an order fact that we update nightly. We use the natural key from the source (Order Number) to locate the fact row and apply changes. The source system is preparing to purge old orders so they can reuse order numbers. They will maintain a table (PurgeTable) that has the order number and the date it was purged. Once an order is purged, no updates can be done to that particular order. However, a new order can be created with that order number.

Our plan is to add a column to our fact table the holds purge date. The ETL process will look at only those fact rows that have a null in the purge date when it processes updates and inserts from the order table. We will also run a periodic job that reads the PurgeTable and marks orders as having been purged by setting the purge date in the fact to the value found in the source table.


My question is does this follow know standards? It seems pretty straight forward to me but I would appreciate input from you guys. Also, can either of you point me in the direction of a documented standard or best practice for dealing with retaining purged data in a data warehouse?

Thanks,
Thom
 
The procedure you describe is adequate, however....

The way to avoid these issues is to have a specific warehouse key generated for each incoming row, referred to as a surrogate or synthetic key. This warehouse key becomes the primary key on the table and provides uniqueness. The natural key would then be the order number concatonated with the purge date.

In general, it is a bad warehouse practice to rely on external systems to provide your primary keys, including those generated by the government such as SSN or EIN.

-------------------------
The trouble with doing something right the first time is that noboby appreciates how difficult it was.
- Steven Wright
 
I totally agree with the concept of using surrogate keys (SKs). We use them exclusively but only on the dimensional side. In the facts, we combine the SKs to make up the primary key (PK). We never, ever use natural keys as primary keys on the warehouse. I guess I didn't make that clear. It is so ingrained in me that I forgot to specify it.

However, in both facts and dimensions, we need a natural key value for linking back to the source systems to faciliate updates. On a dimensional level, we store the natural key and either update the dimensional row or add a new row based upon our slowly changing dimension rules for the given dimension.

For facts, we generally store a degenerative dimensional value that contains the natural key value to the source system for update purposes.

 
Since you are storing the attribute as a degenerate key in the fact table (and assume you want this change to involve the least amount of work), adding a date of purge will serve your purpose which is to be able to see the history of an order by constraining on dates >= the first date posted to the fact table and <= the purge date. In this scenario I default the purge date to the largest date your DBMS of choice will handle to represent not purged. I fell using the null will make the comparitive predicate more complicated.

The other way to do it is to create a mini dimension for order and add the purge data as an attribute but that would result in a lot of additional re-work.

not an industry standard but a defacto standard approach in most shops I have been in.

hope it helps
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top