How to design a fact table to keep track of active dimensions?

josmi · Mar 2, 2012

I would like to design a classic OLAP facts table using a star scheme. The SQL model of the facts table should be independent of any concrete RDBMS technology and portable between different systems.

The problem is this: users should be able to select subsets of the facts based on conjunctive queries on the dimension values defined for the facts. However, the program that provides the interface for doing this to the user should only present those dimensions where anything is still selectable at all. For example, if a user selected year 2001 and for dimension contract code there is only a single value for all records in the fact table for that year, this dimension should not be shown to the user any more.
How can this be achieved in the most performant way? Is there a "classical" way of how to approach this problem?

Any help or pointers to where one could find out more about this would be greatly apreciated - thank you!

johnherman · Mar 5, 2012

Let's look at the scenario. It's kinda like requesting AK and HI not be listed as States for dates in the 1940's. Or Google showing in the Stock Market listing before it began business. But those scenarios have definitive "start dates"

It's more like requesting Feb 29 not be shown as a date in a non-leap year. Or not listing the Ford Thunderbird or Chevy Camaro during the years they weren't manufactured.

But this generic scenario you're describing is fact based. So don't list Microsoft as a vendor if I didn't have any purchase activity with them during some time period.

I think the latter is the scenario you're facing. If contract code is such that it lends itself to a "start date" and/or "end date", then that would be a possible solution. The alternative (the generic fact-based solution) would require a query against the fact table for each dimension as the user refined the search, and then limiting the options based on the domain still available after each "filter" is applied. This might not be bad from a performance point of view since each dimension is indexed against the fact in the star schema (or should me). I don't know of any BI or Analytics software which offers such a solution. You can try asking the various vendors. Otherwise, it seems you need a custom front-end, one that interfaces well with SQL.

====================================
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

josmi · Mar 5, 2012

Thank you for your answer! At least it looks as if I am not missing something totally obvious here ...

Just to clarify: why I want is maybe a bit different as in your examples. The idea is to filter dimensions, not values of dimensions, in situations where that dimension is not applicable for further selection. Lets say for a product database there is a dimension "salescategory" and a dimension "productioncategory" but the dimension "productioncategory" is only available for products from the last two years (for older years, the productioncategory would be a dummy value or null or whatever is used to model missingness). So, once the user has selected a year more than two years ago, the whole dimension "productioncategory" should not be shown any more, because for the current result set (which could still be arbitrarily large) the dimension cannot be used for any useful selection. When dealing with facts tables that model time series, such a situation can be quite frequent, because for the same logical measurement, different sets of metadata may be available for different times or different sources etc.

From a user's point of view this seems like a quite reasonable and frequently necessary requirement: if there are a lot of dimensions that can be used for interactively selecting results sets, clearly one would want to suppress all dimensions which are useless for making the selection?

Technically what I am struggling with is this: this becomes more important as there are more dimensions, but with more dimensions one would need to make even more individual queries to figure out the number of "active" values for each dimension. And those queries would have to go over the full facts table, with the same selection clause as the main facts query. That seems like an awful waste of resources when it could really be done more easily in a single pass by "collecting" the number of different foreign keys (corresponding to values of each dimension) on the fly.

Does anyone know if tool-support (e.g. some extension of SQL) exists for this kind of scenario?

johnherman · Mar 5, 2012

Well, what you describe is a scenario where you have a definitive "Start Date" or "Effective Date". You would need to instruct the front end to ignore any Dimensions where the requested date range is before the "Start Date" of that dimensions. You might be able to accomplish this by temporarily changing the permissions/security of that dimension so that the front end will ignore it. It's a different situation if you don't want to display certain products if the state is Alaska, because you don't sell those products in that state. You would than need a State-Product relation table.

====================================
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

josmi · Mar 6, 2012

The date was just an example. My interest is in some technique or tool that allows to generalize that idea. So no matter what combination of values for some j of n dimensions has already selected (date or other), none of those l dimensions of the k = n-j remaining dimensions (0 <= l <= k) where no further selection is possible/useful should be shown. In this general form, I could not come up with a solution where pre-computation would be feasible, so at the moment the only way I see is to do k additional rather complex and inefficient queries, which is ugly, inelegant and not very satisfying.
So I was wondering if maybe I am still missing some technique to make this less ugly, inelegant and dissatisfying ...

johnherman · Mar 6, 2012

Well, you could use the bridge table concept. Also called a relation table. The bridge table is the intersection of the two (or more) tables within the relationship.

For instance, consider this example.

STATE
StateID
StateName
StateRegion
etc

PRODUCT
ProductID
ProductName
ProductManager
etc

STATE-PRODUCT (or Product-State)
StateProductID
StateID
ProductID

Consider the following Products
AU, Auto Insurance, Al Martin
HO, Homeowners Insurance, Harry Smith
WC, Workmans Comp Insurance, Wanda Jones
LI, Life Insurance, Linda Allen

Now consider this bridge table:
1NYAU
2NYHO
3NYLI
4NCAU
5NCWC
6CAAU
7CAHO
8CAWC
9VTWC

The bridge table shows what products are available in each state, or in reverse, what states have which products. You could join to the bridge table.

Does this help?

====================================
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

josmi · Mar 6, 2012

Thanks for that suggestion, but if I understand correctly this only helps in the case of having two dimensions. I would still have to issue another query in addition to the basic fact table query, but the second query (to figure out if with the current limitation on one dimension, the other still has something to select from) would be more efficient because it goes over a much smaller table.
However, how would this generalize to more than two dimensions? Any combination of values for the j active dimensions can potentially influence the availabilty of each of the remaining k dimensions in a unique way.

After some more thinking the best I could come up with is to pre-compute an aggregated version of the facts table that contains one row for each combination of dimensional values that occur in the original facts table. With this I would still need k queries to figure out which dimensions are still choosable, but since the table could be significantly smaller than the original facts table, the queries could be more efficient.

It still feels like a clumsy solutions when one imagines that the tool could collect all that information on the original facts query internally...

johnherman · Mar 7, 2012

Well, the bridge or relation table could have more than two members. It might be manageable if the intersection of your dimensions that you want to exclude is limited. For instance, if Product is completely unrelated to General Ledger, or Customer completely unrelated to Product, or if that relationship is something you could tolerate as "normal", then no bridge is required. As you add members to the relation table, it will grow quickly, as the combinations grow exponentially, but if you're willing to tolerate a little of the "nonsense" situations, you can probably get by with just a few relation tables. For instance, if you only sell one (of five) products in Alaska, and you're willing to tolerate that Alaska queries will have 4 blank columns, then there may be no need for a relation table between State and Product.

As far as a second query, that can perhaps be avoided with a VIEW or JOIN in the primary query. So, querying against a VIEW rather than directly at the table might have benefits. And, as you mentioned, you could possibly "front end" the actual query with a SELECT to a Temporary table which would be deleted at the end of the session.

====================================
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to design a fact table to keep track of active dimensions?

josmi

Programmer

johnherman

MIS

josmi

Programmer

johnherman

MIS

josmi

Programmer

johnherman

MIS

josmi

Programmer

johnherman

MIS

Similar threads

Part and Inventory Search

Sponsor