Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

OLAP modeling: one cube versus many cubes?

Status
Not open for further replies.

josmi

Programmer
Mar 2, 2012
8
AT
I am still puzzled about the correct way to design a cube vs. several cubes so that reports and queries that compare several different measurements can be made as easily as possible. The problem is that when there are different measurements, these measurements may only share some dimensions or some levels of a hierarchical dimension.
For example (a very artificial example just to illustrate the question) a measurement about working hours may be available on a daily basis, involve the dimension male/female, job category, and also the dimension overtime. Another measurement e.g. production may be available only on a monthly basis, and have the dimension product type. Both measurements share the hierarchies year and month of the time dimension.
These measurements are very different and only share a subset of dimensions, and dimension values, yet we want to create reports and queries where one measurement is compared with the other. For example we want to compare the total female working hours of job category A with the total production of product type X on a monthly basis. Or we want to compare the average daily working hours of all males with total production on a monthly basis.
The usual advice given is not to make the cube to big, avoid "unrelated" measures and avoid too many dimensions, but for some applications, reporting over many measures that are related only by e.g. the time dimension is important. There seems to be no easy, "standard" way to create reports over many cubes in an ad-hoc fashion.
So - how is this kind of problem usually dealt with? Is there some good reading stuff for this?
 
Well, the correct answer is IT DEPENDS. Let me elaborate. Disk is relatively cheap, but the analyst's time is not. And the timing needed to produce the data that the analyst needs is also a factor. If your DW has low levels of granularity, it is possible to create cubes of at different summary levels for the purposes of each analyst or reporting need. Finance may want monthly summaries, but an operations persn may want daily or hourly. Each may also have specific needs of interest with respect to facts and dimensions. My general recommendation is separate cubes. But well planned. If the needs of several groups, reports, or departments can be met with a single cube, and that cube can be created in a timely manner, and the data can be extracted and analyzed in a reasonable timeframe, then one cube serving multiple functions is probably OK. Remember to do "balancing" each time each cube is updated, otherwise different groups will obtain different numbers. For instance, the daily cubes should always balance to the monthly or MTD cubes. And similarly for other dimensions which may have summaries - geography, sales channel, product line, etc. Hope this helps.

====================================
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Thank you for this. Our situation is a bit different though: there is no single, well-defined set of reports but the very purpose of the the modelling should be to allow for a large number of very different ad-hoc reports that usually will compare different measurements (where the different measurements may only share certain dimensions and may have different granularity).
The problem with several cubes then appears to be (I am not certain) that the user (or at least the GUI code that has to be developed) will have to know which cube contains and which queries can be done on which cube.
This is bad.
We would like to come up with a GUI that contains all the measures and all the dimensions and that lets the user naturally navigate to all the combinations of measures and dimensions that make sense while hiding everything else. Since the potential queries/reports may combine so many different measures with so many different dimensions, there is no practical way to really anticipate them all and plan for all. Rather, we would like to describe the compatibilities on a meta-level and then have the GUI and MDX or generated SQL take care of the rest.
 
There is a very concrete reason for not building 1 large cube to hold all data: the element of so-called 'sparsity'. You may need to look into this and its relation with exploding cube sizes. I think ROLAP offers a better solution within the aspect, though some vendors (like IBM Cognos with TM1) offer solutions that have elegant ways to reduce sparsity and cube-sizes.

What are you using at the moment?

Ties Blom

 
We are just in the planning and modelling phase to figure out which combination of modeling and tools would fit best. If at all possible, the preferred toolchain would be open source (e.g. Pentaho plus open source RDBMS).
We are aware of the sparsity problem, but despite the huge number of dimensions, the final fact table would probably not exceed a few hundred million or maybe a billion rows by much.
The real issue that is puzzling here is this: depending on the way how the data is modeled in one or several cubes, there will be situations where a row has a combination of dimension values that apply to measurement A but not measurement B. For another combination of dimensions the reverse may be true and these dimensions may apply to measurement B but not A. In other words, for certain sets of dimensions, some measurements are more granular than others. The most obvious way to deal with this seems to be to make all measurements null where the dimensions combination does not apply. Would that be the standard approach that can also be expected to be supported by the usual tools and e.g. MDX?
 
FYI , Pentaho only offers ROLAP which may be a bit daunting with a billion row fact table. Spend the last 2 months going through the Pentaho suite. The ETL tool (PDI) is really marvelous. The multidimensional tool lacks a LOT of features to, say Cognos or SSAS. Are you considering adding aggregates on the fact table?

Ties Blom

 
Thank you for that info! Pentaho or just some open-source RDBMS is just the vague idea we have for now as we still struggle with the more fundamental questions of how to best model our data. Denormalization and cube-oriented star-schemas sound good on first sight, but there are a number of rather complex problems when the requirement is to allow very flexible ad-hoc queries over several different measurements that are associated with different dimensions or have different granularity. Most books about OLAP and reporting only really consider a rather small set of fixed reporting tasks which are usually well defined in advance. In those scenarios it is usually possible to identify individual cubes for individual users and their reporting needs. But in our case, users want to combine nearly everything with everything else in their reports. So in order to model this without too many sparse entries and without dimensions that simply do not apply to a measurement, we would need a large number of cubes. But then reports that require two or more cubes get rather hard to do and to implement in a GUI, no? Also MDX cannot be used across several cubes.
The dilemma we are in looks rather basic and principal so I wonder if that is something that is discussed in some whitepaper or book or some web-page -- it is hard to believe that we are the first one who face these kinds of problems.
But most stuff I read about BI, reporting and OLAP either ignores this problem completely or assumes that it is possible to factor the data in a way so that each user group gets their own cube that would satisfy all their query and reporting needs.
 
For your needs I doubt you will find open-source solutions.. Multigrain issues are very well handled by Cognos ROLAP (DMR) which allows setting a context for every combination of dimension and fact. It allows for setting budget data measures stored at year level to year in the time dimension and in the same model setting daily sales for all levels in the time hierarchy from day upwards. I hear also very good stories about MS SSAS which I am investigating right now. Well, it is not open source but may be a good alternative:



IBM Cognos also offers TM1 which is basically an in-memory OLAP solution that allows up 90% reduction of sparsity, but it will not be cheap :(


Ties Blom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top