Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Wide fact table vs. multiple fact tables

Status
Not open for further replies.

optimax

Programmer
Jun 24, 2008
4
0
0
LU
I am currently developing a data warehouse for a telecom company.
We collect several different measures (100+) from the network elements on an hourly basis.
I'm wondering what the optimal data warehouse design would be:
1) Since all measures bear the same granularity (1 hour), and originate from similar network elements, it would make sense to put them all in the same fact table. I would therefore end up with a fact table composed of about 5 dimensions and 100+ fact fields. Does a 100+ field table sound like an acceptable design?
2) I could create several fact tables with each about 5 dimensions and each only a subset of the 100+ facts, grouped in a way likely to be queried by the end users. I would therefore have several fact tables with maximum 25 fact fields (which seems more reasonable?), but I would loose some flexibility in the sense that users could no longer query the data in ways that were unplanned or unexpected.
I would greatly appreciate any advice or comment or field experience with the 2 options proposed (or a 3rd alternative...).
 
You're right, 100 measures is quite a bit. The first thing I am wondering is if these are all 100 actual different measures, or are there repeated measures stated in different units of measure? For example: WeightSoldPounds, WeightSoldKilograms.
 
Well, these are clearly 100+ different measures (and most of them are already groupings of an original set of 1000+ 'raw' counters). These telecom network elements are quite complex and thus require a fair amount of indicators. This is inherent to most complex radio transmission systems.
From a pure radio engineering point of view, of course not all queries involving any of the 100 facts make sense. However, I expect the end users to ask for the possibility to build unexpected or uncommon query types at certain points in time, for example when troubleshooting a particular network element, or doing root cause analysis.
 
Although 100+ measures is indeed much, if they are in reality related and have the same grain, I would suggest to opt for 1 fact table.
But is this case the expected queries may be of considerable impact. Test and check if the 100+ measures fact table can answer the expected questions with a reasonable performance. Or if you need to split the fact table.

And I wonder what queries you anticipate that you cannot answer with several fact tables.
How far do you want to go to meet the users in designing for Unexpected questions?
 
You could establish multiple fact tables, each with a 1:1 relation to each other. And the facts could be grouped either by subject area or perhaps by frequency of usage. You would want to try to limit the number of joins from fact to fact (to fact, to fact, ....) to get optimal performance.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
There are also some other potential drawbacks I see with multiple fact tables in this scenario. For example, if a record errors out in your ETL to one fact table, you will want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.

Secondly, you might need to come up with a scheme to make incremental loads to the fact tables available to the end users only after all the related fact tables are loaded.

Thirdly, and least importantly, multiple fact tables could end up taking more storage space as you will have the foreign keys to the dimensions repeated multiple times.
 
How is the data going to be exposed to the users? For example if you are using MS SSAS then the issue of multiple fact tables actually becomes the better option as you would set the source of a measure group to a specific fact table. If you had a single fact table you would have to have multiple queries hitting a single fact table to build your measure groups.

If the users are reporting directly from the Warehouse then I would probably still go multiple fact tables following John's advice by grouping them into related facts. I would place each table on a seperate file group and drive set to maximize available IO.

Paul
---------------------------------------
Shoot Me! Shoot Me NOW!!!
- Daffy Duck
 
Following the general line of thought (subject area fact tables), you could even duplicate data across subject area fact tables to help prevent the joining of fact tables. Trading disk space and off-hours CPU time for rapid query response is a major tenet of DW.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Thank you all for the great comments!
After discussing the issue thoroughly with my client, it appears that the 'unexpected' queries might not be so infrequent after all (thus probably calling for the wide fact table option). It seems that the flexibility is really a key factor to them, and they are ready to loose a bit on the response time to keep all analysis options open.
Anyway, based on the comments above, my intention is indeed to start with the wide fact table option because it also means that the front-end applications (normally this DW will be queried directly) will remain simple. I will then monitor closely the main query types and emerge most likely groupings if performance becomes an issue.
Migrating to the option of multiple fact tables will thus remain a possible design for the future. At this stage, I don't think that the added complexity of the ETL to keep data consistent across multiple fact tables under error conditions is worth the effort for the potential performance gain.
The whole project is developed using the Extreme Programming method, so we remain open to changes all along, but we generally go for the simpler solutions in the short term.
Thanks again for your help!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top