How Should We Implement A Slowly Changing Dimension?
Currently, our data warehouse has only Type 1 Slowly Changing Dimensions (SCD). That is to say we overwrite the dimension record with every update. The problem with that is when data changes, it changes for all history while this is valid for data entry corrections, it may not be valid for all data. An acceptable example could be Customer Date of Birth. If the date of birth was changed, chances are the reason was that their data was incorrect.
However, if the Customer address were changed, this may and probably does mean the customer moved. If we simply overwrite the address then all sales for that customer will belong to the new address. Suppose the customer moved from Florida to Ohio. If we were trying to track sales patterns by region, all of the customer’s purchase that were made in Florida would now appear to have been made in Ohio.
Type 1 Slowly Changing Dimension
Customer Dimension
Customer Dimension After Edits
In the example above, the DOB change doesn’t affect any dimensional reporting facts. However, the City, State change would have an affect. Now all sales for Bob Smith would appear to come from Dayton, Ohio rather than from Tampa, Florida.
The solution we have chosen for solving this problem is to implement a Type 2 slowly changing dimension. A Type 2 SCD records a separate row each time a value is changed in the dimension. In our case, we are declaring that we will only create a new dimension record when certain columns are changed. In the example above, we would not record a new record for the DOB change but we would for the address change.
Type 2 Slowly Changing Dimension
Customer Dimension
Customer Dimension After Edits
As you can see, there are two dimension records for Bob Smith now. They both have the same CustKey values, but the have different ID values. All future fact table rows will use the new ID to link to the Customer dimension. This is accomplished by the use of the Current Flag. The ETL process looks only at the current flag when recording new orders. However, in the case of an update to an order the Effective Date must be used to determine which customer the update applies to.
The primary issue with Type 2 SCD is the volume of data grows exponentially as more changes are tracked. This can impact performance in a star schema. The principle behind the star schema design is that while facts are few columns, they have many rows but they only have to perform single level joins to resolve their dimensions. The assumption is that the dimensions have lots of columns but relatively few rows. This allows for very fast joining of data.
Conforming Dimensions
For the purposes of this discussion conforming dimensions only need a brief definition. Conforming dimensions are a feature of star schemas that allow facts to share dimensional data. A conforming dimension occurs when two dimensions share the same keys. Often they have different attributes. The goal is to ensure that any fact table can link to the conforming dimension and consume its data so long as the dimension is relevant.
Conforming Dimension
Customer Dimension
Billing Dimension
In the example above, we could use the ID from the Customer dimension in a fact and in the future a link to the Billing dimension could be established without having to reload the data.
We are considering a slight modification to the standard Type 2 SCD. The idea is to maintain two dimensions one as a Type 1 and one as a Type 2. The problem with this is we lose the ability to use conforming dimensions.
Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
Customer Dimension Type 2
As you can see, the current ID for Bob Smith in the Type 1 SCD is 1001, while it is 1003 in the Type 2 SCD. This is not conforming.
Our solution is to create a composite key for the Type 2 SCD.
Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
Customer Dimension Type 2
In the example above, the Type 1 and the Type 2 dimensions conform on the ID level. If a fact needs the historical data it will consume both the ID and the SubKey.
Currently, our data warehouse has only Type 1 Slowly Changing Dimensions (SCD). That is to say we overwrite the dimension record with every update. The problem with that is when data changes, it changes for all history while this is valid for data entry corrections, it may not be valid for all data. An acceptable example could be Customer Date of Birth. If the date of birth was changed, chances are the reason was that their data was incorrect.
However, if the Customer address were changed, this may and probably does mean the customer moved. If we simply overwrite the address then all sales for that customer will belong to the new address. Suppose the customer moved from Florida to Ohio. If we were trying to track sales patterns by region, all of the customer’s purchase that were made in Florida would now appear to have been made in Ohio.
Type 1 Slowly Changing Dimension
Customer Dimension
Code:
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Tampa FL
1002 LJ004 Lisa Jones 10/15/1954 Miami FL
Code:
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Dayton OH
1002 LJ004 Lisa Jones 10/15/1954 Miami FL
In the example above, the DOB change doesn’t affect any dimensional reporting facts. However, the City, State change would have an affect. Now all sales for Bob Smith would appear to come from Dayton, Ohio rather than from Tampa, Florida.
The solution we have chosen for solving this problem is to implement a Type 2 slowly changing dimension. A Type 2 SCD records a separate row each time a value is changed in the dimension. In our case, we are declaring that we will only create a new dimension record when certain columns are changed. In the example above, we would not record a new record for the DOB change but we would for the address change.
Type 2 Slowly Changing Dimension
Customer Dimension
Code:
ID CustKey Name DOB City St Curr Effective Date
1001 BS001 Bob Smith 6/8/1961 Tampa FL Y 5/1/2004
1002 LJ004 Lisa Jones 10/15/1954 Miami FL Y 5/2/2004
Customer Dimension After Edits
Code:
ID CustKey Name DOB City St Curr Effective Date
1001 BS001 Bob Smith 6/8/1961 Tampa FL N 5/1/2004
1002 LJ004 Lisa Jones 10/15/1954 Miami FL Y 5/2/2004
1003 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/27/2004
As you can see, there are two dimension records for Bob Smith now. They both have the same CustKey values, but the have different ID values. All future fact table rows will use the new ID to link to the Customer dimension. This is accomplished by the use of the Current Flag. The ETL process looks only at the current flag when recording new orders. However, in the case of an update to an order the Effective Date must be used to determine which customer the update applies to.
The primary issue with Type 2 SCD is the volume of data grows exponentially as more changes are tracked. This can impact performance in a star schema. The principle behind the star schema design is that while facts are few columns, they have many rows but they only have to perform single level joins to resolve their dimensions. The assumption is that the dimensions have lots of columns but relatively few rows. This allows for very fast joining of data.
Conforming Dimensions
For the purposes of this discussion conforming dimensions only need a brief definition. Conforming dimensions are a feature of star schemas that allow facts to share dimensional data. A conforming dimension occurs when two dimensions share the same keys. Often they have different attributes. The goal is to ensure that any fact table can link to the conforming dimension and consume its data so long as the dimension is relevant.
Conforming Dimension
Customer Dimension
Code:
ID CustKey Name DOB City State
1001 BS001 Bob Smith 6/8/1961 Tampa FL
1002 LJ004 Lisa Jones 10/15/1954 Miami FL
Billing Dimension
Code:
ID Bill2Ky Name Account Type Credit Limit CustKey
1001 9211 Bob Smith Credit $ 10,000 BS001
1002 23421 Lisa Jones Cash $ 100 LJ004
In the example above, we could use the ID from the Customer dimension in a fact and in the future a link to the Billing dimension could be established without having to reload the data.
We are considering a slight modification to the standard Type 2 SCD. The idea is to maintain two dimensions one as a Type 1 and one as a Type 2. The problem with this is we lose the ability to use conforming dimensions.
Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
Code:
ID CustKey Name DOB City St Curr Effective Date
1001 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/1/2004
1002 LJ004 Lisa Jones 10/15/1957 Miami FL Y 5/2/2004
Customer Dimension Type 2
Code:
ID CustKey Name DOB City St Curr Effective Date
1001 BS001 Bob Smith 6/8/1961 Tampa FL N 5/1/2004
1002 LJ004 Lisa Jones 10/15/1957 Miami FL Y 5/2/2004
1003 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/27/2004
As you can see, the current ID for Bob Smith in the Type 1 SCD is 1001, while it is 1003 in the Type 2 SCD. This is not conforming.
Our solution is to create a composite key for the Type 2 SCD.
Type 2 and Type 1 Slowly Changing Dimension
Customer Dimension Type 1
Code:
ID CustKey Name DOB City St
1001 BS001 Bob Smith 6/8/1961 Dayton OH
1002 LJ004 Lisa Jones 10/15/1957 Miami FL
Customer Dimension Type 2
Code:
ID SubKey CustKey Name DOB City St Curr Eff Date
1001 001 BS001 Bob Smith 6/8/1961 Tampa FL N 5/1/2004
1002 001 LJ004 Lisa Jones 10/15/1957 Miami FL Y 5/2/2004
1001 002 BS001 Bob Smith 6/8/1961 Dayton OH Y 5/27/2004
In the example above, the Type 1 and the Type 2 dimensions conform on the ID level. If a fact needs the historical data it will consume both the ID and the SubKey.