Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

duplicating fields on both dimension and fact table.

Status
Not open for further replies.

ngthompson

IS-IT--Management
Jan 6, 2009
2
0
0
US
Hi all,

Is it a bad idea to have the same field on both a dimension and fact table?

Here's my scenario:
I have about 10 descriptive numeric fields that reside on a dimension table. While the fields are numeric they are truly descriptive and rarely change and so they are dimension attributes. However, these same fields often participate in complex calculations. The dimension table is fairly large (15-20 million loan records).

In our environment it is possible run a meaningful report by simply hitting the fact table. If the fields in question don't exist on the fact table then we'll perform an unnecessary join to the large dimension table.

In order to improve performance, we've decided to duplicate the 10 fields on both the dimension & fact table. This way any reports that don't need other dimensional data can run just from the fact table.

Design-wise, does it seem like a bad idea to duplicate the fields?

Any insight will be greatly appreciated.
Thanks!
 
I wonder hwo you can run a report on just a fact table. You don't use a date dimension? Are you using surrogate keys in teh fact table, or are you using the keys from teh source system?

Design wise I would say, yes, it is a bad idea to duplicate these fields.
Consider the effects of a change of one of these fields as well. Impact on ETL would be significant.

But it depends on the size of the fact table (nr of rows) as well. If this is way less then the 20Million from the dimension you might consider this, if you take the growth of the fact table into account.

How many reports would need these fields? Nearly all or just a few?
 
Hans63 makes some good points, but I believe there are situations where duplicating a column onto the fact table can be beneficial. From a theoretical point of view, one of the tenets of DW is to trade storage (duplicate data) for better performance. One case which might apply: Let's say you have a small dimension, like State, or maybe some company specific attribute like product group. With small reference tables, sometimes performance can be gained by not having to join to the dimension. I would think this is the exception rather than the rule. Also of consideration is whether the attribute which you are denormalizing into the fact is a slowly changing dimenions.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Hans63 and johnherman thank you both so much for your responses.

The fact table in question contains loan balance information -- the grain is 1 record per loan per month. The large dimension in question is a loan attribute dimension and contains 1 record per loan.

Hans63, I misspoke when I said we can run reports solely from the fact table. What I meant is we can run reports solely from the fact table & date dimension (and other smaller dimensions) without having to go to the large dimension mentioned above. For example, our users have a need to do complex calculations (using the duplicated fields) across all loans in a given month. However, those duplicated fields are also attributes of the loan and can be used in the "where" clause for other reports. Because of the size of the loan attribute dimension (20 million or so records), we thought it best to duplicate the fields to improve performance.

johnherman, all the duplicated fields are SCD1 on the large loan attribute dimension table.

From a design perspective, I would prefer not to duplicate the fields, but the performance consideration makes me unsure... Have either of you had this situation before? Is a bad idea because it complicates the ETL process?
 
The fact table you describe will probably get way bigger then the loan dimension, unless you somehow manage to reduce the number of records. e.g. by holding only 1 year of history. But that will give you 12 * 20 million = 240 million records in the fact table. (this is the space part of the trade-off)

I take it that you need both the balance information and the other attributes from the dimension to perform those complex calculations. Meaning you will have to join 20M dimension records to 20M fact records.

As to the impact on the ETL proces, consider the fact that, for a change in a loan attribute, you will have to update the fact table and the dimension table. As shown above the fact table may get quite big. And this is an extra piece of ETL that needs to be maintained.

Performance, both for the ETL and the report, will depend on available hardware and indexing strategies.

The right choice also depends on the number of reports needing these complex calculations, the available timeframe and the importance of these report for the business. Is it just 1 report, I wouild opt for NOT duplicating these fields. If you have about 20% of your reports needing (some of) those calculations I would consider duplicating.
Please make sure that the business understands that this report may take some time to produce.

Does the business need these calculations on ALL loans every time? Or can you filter on the loan as well?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top