Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Alternative Type 2 Slowly Changing Dimension (Star Schema) 2

Status
Not open for further replies.

HogHunter

Programmer
Jun 4, 2004
20
0
0
US
How Should We Implement A Slowly Changing Dimension?

Currently, our data warehouse has only Type 1 Slowly Changing Dimensions (SCD). That is to say we overwrite the dimension record with every update. The problem with that is when data changes, it changes for all history while this is valid for data entry corrections, it may not be valid for all data. An acceptable example could be Customer Date of Birth. If the date of birth was changed, chances are the reason was that their data was incorrect.

However, if the Customer address were changed, this may and probably does mean the customer moved. If we simply overwrite the address then all sales for that customer will belong to the new address. Suppose the customer moved from Florida to Ohio. If we were trying to track sales patterns by region, all of the customer’s purchase that were made in Florida would now appear to have been made in Ohio.


Type 1 Slowly Changing Dimension
Customer Dimension
Code:
ID    CustKey Name            DOB             City    State
1001  BS001   Bob Smith       6/8/1961        Tampa   FL
1002  LJ004   Lisa Jones      10/15/1954      Miami   FL
Customer Dimension After Edits
Code:
ID      CustKey Name            DOB             City    State
1001    BS001   Bob Smith       6/8/1961        Dayton  OH
1002    LJ004   Lisa Jones      10/15/1954      Miami   FL

In the example above, the DOB change doesn’t affect any dimensional reporting facts. However, the City, State change would have an affect. Now all sales for Bob Smith would appear to come from Dayton, Ohio rather than from Tampa, Florida.


The solution we have chosen for solving this problem is to implement a Type 2 slowly changing dimension. A Type 2 SCD records a separate row each time a value is changed in the dimension. In our case, we are declaring that we will only create a new dimension record when certain columns are changed. In the example above, we would not record a new record for the DOB change but we would for the address change.


Type 2 Slowly Changing Dimension
Customer Dimension
Code:
ID      CustKey Name        DOB        City    St   Curr Effective Date
1001    BS001   Bob Smith   6/8/1961   Tampa   FL   Y    5/1/2004
1002    LJ004   Lisa Jones  10/15/1954 Miami   FL   Y    5/2/2004

Customer Dimension After Edits
Code:
ID      CustKey Name        DOB        City    St   Curr Effective Date
1001    BS001   Bob Smith   6/8/1961   Tampa   FL   N    5/1/2004
1002    LJ004   Lisa Jones  10/15/1954 Miami   FL   Y    5/2/2004
1003    BS001   Bob Smith   6/8/1961   Dayton  OH   Y    5/27/2004


As you can see, there are two dimension records for Bob Smith now. They both have the same CustKey values, but the have different ID values. All future fact table rows will use the new ID to link to the Customer dimension. This is accomplished by the use of the Current Flag. The ETL process looks only at the current flag when recording new orders. However, in the case of an update to an order the Effective Date must be used to determine which customer the update applies to.

The primary issue with Type 2 SCD is the volume of data grows exponentially as more changes are tracked. This can impact performance in a star schema. The principle behind the star schema design is that while facts are few columns, they have many rows but they only have to perform single level joins to resolve their dimensions. The assumption is that the dimensions have lots of columns but relatively few rows. This allows for very fast joining of data.

Conforming Dimensions

For the purposes of this discussion conforming dimensions only need a brief definition. Conforming dimensions are a feature of star schemas that allow facts to share dimensional data. A conforming dimension occurs when two dimensions share the same keys. Often they have different attributes. The goal is to ensure that any fact table can link to the conforming dimension and consume its data so long as the dimension is relevant.


Conforming Dimension
Customer Dimension
Code:
ID      CustKey Name            DOB         City   State
1001    BS001   Bob Smith       6/8/1961    Tampa  FL
1002    LJ004   Lisa Jones      10/15/1954  Miami  FL


Billing Dimension
Code:
ID      Bill2Ky Name            Account Type    Credit Limit    CustKey
1001    9211    Bob Smith       Credit          $    10,000     BS001
1002    23421   Lisa Jones      Cash            $       100     LJ004


In the example above, we could use the ID from the Customer dimension in a fact and in the future a link to the Billing dimension could be established without having to reload the data.

We are considering a slight modification to the standard Type 2 SCD. The idea is to maintain two dimensions one as a Type 1 and one as a Type 2. The problem with this is we lose the ability to use conforming dimensions.


Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
Code:
ID      CustKey Name       DOB         City    St  Curr Effective Date
1001    BS001   Bob Smith  6/8/1961    Dayton  OH  Y    5/1/2004
1002    LJ004   Lisa Jones 10/15/1957  Miami   FL  Y    5/2/2004

Customer Dimension Type 2
Code:
ID      CustKey Name       DOB         City    St  Curr Effective Date
1001    BS001   Bob Smith  6/8/1961    Tampa   FL  N    5/1/2004
1002    LJ004   Lisa Jones 10/15/1957  Miami   FL  Y    5/2/2004
1003    BS001   Bob Smith  6/8/1961    Dayton  OH  Y    5/27/2004

As you can see, the current ID for Bob Smith in the Type 1 SCD is 1001, while it is 1003 in the Type 2 SCD. This is not conforming.

Our solution is to create a composite key for the Type 2 SCD.


Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
Code:
ID          CustKey Name        DOB        City    St
1001        BS001   Bob Smith   6/8/1961   Dayton  OH
1002        LJ004   Lisa Jones  10/15/1957 Miami   FL

Customer Dimension Type 2
Code:
ID   SubKey CustKey Name        DOB        City    St  Curr Eff Date
1001 001    BS001   Bob Smith   6/8/1961   Tampa   FL  N    5/1/2004
1002 001    LJ004   Lisa Jones  10/15/1957 Miami   FL  Y    5/2/2004
1001 002    BS001   Bob Smith   6/8/1961   Dayton  OH  Y    5/27/2004

In the example above, the Type 1 and the Type 2 dimensions conform on the ID level. If a fact needs the historical data it will consume both the ID and the SubKey.

 
Are you implementing two physical structures? The type 1 "table" could be a view of the type 2 table. Why would you do this? Do you have large numbers of customers such that performance is an issue? If I was to implement your solution (rather than the view) what would I gain for the extra disk space consumed and the extra join required during certain queries? Please help me understand the advantage of this solution in your environment.

Thanks.

Sometimes the grass is greener on the other side because there is more manure there - original.
 
We are implementing this as two physically structures. The reasoning being that there are many facts that will only need the Type 1 dimension. A view still has to deal with the rows by excluding them and this would impact performance. Space is so cheap that we do not mind consuming it. As for the extra join, in fact the Type 2 table would replace the Type 1 table in the join statements where that data was needed. Both tables wouldstill reside in the warehouse.

The primary advantage is that we can maintain conforming dimensions for drill across operations.

Thanks for you input, I hope I have cleared up your questions. If not, I would be happy to elaborate.
 
I still fail to see the need for 2 physical structures. The method to keep them in sync would be more effort than is required. if a type 1 change occurs you have to update the type 2 record and if information in the type 2 record changes you have to update the type 1 record. Having implemented Both Type 1 and Type 2 Slowly Changing dimensions I found it easier to maintain a single physical structure with 2 logic flows.


example:
Code:
VendorID VendorName    VendorPhone   VendorSt Current
-------- ------------- ------------  -------- -------
123      Bob's Widgets  802-555-1212   FL     1
For this example Phone is a TYpe1 change and State would be a type 2 change.

First logic flow would be to update all type 1 changes. so an update to the above would be

Code:
VendorID VendorName    VendorPhone   VendorSt Current
-------- ------------- ------------  -------- -------
123      Bob's Widgets  802-555-1222   FL     1
The second would be to insert all type 2 changes. THis is accomplished through the insertion of a new record and the deactivation of the old record.

Code:
VendorID VendorName    VendorPhone   VendorSt Current
-------- ------------- ------------  -------- -------
123      Bob's Widgets  802-555-1212   FL     0
123      Bob's Widgets  802-555-1212   OH     1
The 3rd would be the insertion of all new dimension members

Code:
VendorID VendorName    VendorPhone   VendorSt Current
-------- ------------- ------------  -------- -------
123      Bob's Widgets  802-555-1212   FL     0
123      Bob's Widgets  802-555-1212   OH     1
124      Jane's Sweets  603-555-1212   CA     1

Then when you report on the data you can report on historical using type 2 changes and still maintain type 1 changes.

This also simplifies issues such as what if a column is determined to change from a type 1 or type 2 but a type 2 change doesn't occur. IT also allows the use of a single dimension key.

I have discovered when it comes to the end user less is usually better. If you have to many dimension or dimensions that represent the same thing but only in a different manner users can easily get confused and then believe the issue is in the data.

"Shoot Me! Shoot Me NOW!!!"
- Daffy Duck
 
In your example, what is the primary key to the table? It cannot be VendorID because there is no uniqueness.

On the point of simplicity, I believe two tables while more difficult to manage from an ETL perspective, are actually easier from a user perspective. The reason is that the facts that require type 1 can be linked to the type 1 dimension via a Foreign Key constraint so the user doesn't have to even know the type 2 table exists. The converse is true as well. The facts that require type 2 can be linked to the type 2 dimension via a Foreign Key constraint. They will not be aware of the type 1 dimension.

 
HogHunter,

Can you not create a view to simulate type one dimension?
Most of the warehouse implementations use views for one reason or another. With proper indexes you can improve execution performance.

Anand
 
HogHunter,

Everyone's jumping all over you on the view issue!

Disk space really isn't an issue. It's cheap. Plus, you can structure the indexes on the two tables differently, which is important since the way you query a current dimension table is different from the way you query a historical dimension table.

Personally, I would use a view but partition the table on the Current flag. But performance management may be difficult since having different local indexes on different partitions leads to unpredictable behavior.

ETL isn't that hard either. Just maintain the logic to populate your historical table, and build your first table with a Current = 'Y' filter. Yes, this eats more ETL time, since you're making forcing Table 1 to be built after Table 2, instead of in parallel, but building Table 1 shouldn't take that much time.

Personally, the only change I would make would be to do away with the SubKey in Table 2 and put an Effective End Date in instead. The Effective End Date would be the Effective Date of the following record.

What good does the SubKey do? An Effective End Date would allow you to easily zoom in on the version you need for a particular time period, with an Effective Date <= Date < Effective End Date filter.

The challenge, as always, is indexing for performance, especially with non-equijoins. Let us know what you find.
 
The uniqueness is in the combination of the vendorid and current status. With only 1 record being current your facts are keyed to this record. As for useability only input from your users can answer that question.


&quot;Shoot Me! Shoot Me NOW!!!&quot;
- Daffy Duck
 
MDXer,

The uniqueness would then be lost after the third edit to the record. Or else, one version of the record would be lost.
 
Views vs. multiple tables

Views are slower even with indexing. If ETL complexity and space are the only costs, I can mitigate those concerns easier then the cost to performance. Space is cheap and dimensions are typically the smaller elements of the warehouse. Therefore multiplying their size doesn't impact storage significantly. The complexity of the ETL process is incremental, as a matter of fact with our tool, (DataStage) the overheard is negligible. Even so, ETL happens once a day, view access would occur many times per day. A savings in performance on the read side is substantial. While I can manage the space on my server and easily increase it, I cannot say the same of processing speed.

Sub-Key

The purpose of the sub key is to extend the concept of using non-meaningful surrogate keys. This also improves performance. If the key were made up of an integer combined with a date, the result would be a rather large string. Then you have a choice, store the string and save the cost of conversion on query or convert on the fly and pay the expense. Either way the key is much larger and more complex which defies the rationale for surrogate keys. Remember the key to the dimension must be stored on every fact row. While dimensions tend to have relatively few rows and can afford to be very wide, facts have many rows and ought to be as narrow as possible.

Thanks for the input,
Thom [americanflag]
 
A traditional "Kimball-like" type 2 dimension table would look like this.

Customer Dimension Type 2


ID CustKey Name DOB City St Eff Date
1001 BS001 Bob Smith 6/8/1961 Tampa FL 5/1/2004
1002 LJ004 Lisa Jones 10/15/1957 Miami FL 5/2/2004
1003 BS001 Bob Smith 6/8/1961 Dayton OH 5/27/2004
1004 BS002 Bob Smith, Jr 5/31/2004 Dayton OH 5/31/2004
1005 BS001 Bob Smith, Sr 6/8/1961 Dayton OH 5/31/2004

Where ID is an integer, a surrogate (or synthetic) warehouse key and CustKey is the natural key. It has been proven that you cannot get any better performance than this way, while still enjoying the benefits of type 2 changing dimensions.

However, since Data Warehouses are all about performance, I can understand Thom's willingness to trade disk space and off-hours processing for greater performance during normal business hours. So, his maintenance of a type-1 or current state of the customer in a separate physical table would appear to be a good solution for his environment.



Sometimes the grass is greener on the other side because there is more manure there - original.
 
My concerns steamed out of a totally different consideration. (Well, I was considering the usual reasons, load performance and additional disk space, also but HogHunter has given suitable rationale for his environment, which fine.)
In this scenario, the customer dimension in no more a ‘conformed dimension’. There are two customer dimensions now. Some day in future both may succumb to different interpretations and follow a separate development path.
You may not see that as a problem but I suggest you may need to keep tighter control over future development process.


Anand
 
Anand,

Thanks for you comments, I wonder why you say the solution I outlined implies there will no longer be a conformed dimension. In fact, that was the very reason I was willing to go through the effort. In the case of, the type 1 side, the dimension conforms to the type side because they both share the major portion of the key. Therefore if you were to drill across from a fact the consumed the type dimension to a fact that consumed the type 2 dimension, could could simply pass the type 1 key value. The type 2 fact would then be required to roll up all of the subkeys. Needless to say, unrolled data can be viewed as well.

On the other side, the type 2 fact can drill across to the type 1 fact by omitting the subkey.

Since in both cases the major portion of the key will have the same values the dimensions conform.

Am I off base here? It wouldn't be the first time a I missed something. :)

Thanks,
Thom
 
I think you actually have a single dimension with a hierarchy if you want to treat it that way, since the type 2 records are "children" of the type 1. Or, to look at it the other way, type 2 records can "roll-up" into their type 1 record.

Sometimes the grass is greener on the other side because there is more manure there - original.
 
Thom,

By definition conformer dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. Your solution doesn’t necessarily have same contents in these two tables. Granted that we all take architectural liberty on some of these definitions to make them workable in our environment. In that regard it is a brilliant idea.
My concern is that now since you have two customer dimensions there is a potential that they may follow separate development path rendering them to be similar but non-conformed dimensions.
Would you consider separating the type 2 attributes? Say, for example,

Type 1 - Customer Dimension
Code:
ID          CustKey Name        DOB 
1001        BS001   Bob Smith   6/8/1961 
1002        LJ004   Lisa Jones  10/15/1957

and Type 2 - Customer Address Dimension
Code:
ID   SubKey City    St  Curr Eff Date
1001 001    Tampa   FL  N    5/1/2004
1002 001    Miami   FL  Y    5/2/2004
1001 002    Dayton  OH  Y    5/27/2004

You can now include this compound primary key of customer address table in corresponding of the fact tables.

Then again this may impact performance as you are adding extra join in your queries. ;-)


Anand
 
Anand,

Thanks again for your reply. I don't know that I agree with your definition of a conformed dimension.

Laura Reeves, who co-authored Kimball's The Data Warehouse Lifecycle Toolkit gives the following definition in her class at TDWI:

Conformed Dimension - A dimension that is shared amoung data marts. A dimension is conformed between data marts either if it is exactly the same dimension (including all attributes and rollups within the dimension) or one dimension is a strict subset of the other.

Ralph Kimball gives a similar though less explicit definition in the toolkit:

A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined

My solution matches his definition since each fact that contains a dimensional key, has what it takes tosuccessfully join to eiter dimension.

In regard to your suggestion of splitting the attributes, we may not be able to do that because the attributes are consumed by fact on both sides. The distinction is that some facts don't use the historical information for aggregation. Addtionally, that method would make the ETL process much more complex and could lead to confusion about where which columns of data might be.

Thanks,
Thom
 
John Herman

I agree with your assessment, having said that, do you see any issues with the design?

Thanks again,
Thom
 
No. It appears to me that you have made a good design decision for your environment, although it violates some standard dimensional modeling techniques. If, however, theory could be applied to each DW environment and scenario, the DW vendors would have long since developed a "cookie cutter" model for DW. Even the examples in Kimball's DW Toolkit, Adamson & Venerable's DW Design Solutions, and Imhoff et al Mastering DW Design need to be customized to each particular situation.

In addition, you are using sound DW principles in trading off hours processing and disk space for increased performance during your users' peak hours.

And, on general principles, I disagree with Anand's suggestion to split out the address, although I admit there may be certain applications where it is better split. In general, for performance reasons, you should avoid snowflaking, especially on large tables, and you had previously indicated that you have a large customer table.

Thanks for posting this very interesting thread.
---------

Sometimes the grass is greener on the other side because there is more manure there - original.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top