Count Distinct

brmstr · Feb 12, 2004

Hi all,

we're facing frequent problems with calculating count distincts, and I think they could be of greater value to be discussed here (sorry if this seems too enthusiastic to some of you, but believe me if you're not aware of this you'll run into problems)

First, consider a DWH where you have millions of customers making purchases, and we're lucky because we can identify them all (well, probably it would have been better not being able to, as then the data volume would be smaller

, and there's a table in which the unique customer ID is stored.

However, when you want the full flexibility over time you may face some issues with the data volume and also query speed, so you think of an alternative approach where you precalculate predefined levels, as aggregates won't work fine (how to aggregate efficiently on so many distinct customers ...?! not possible at all...). So you put a LEVEL indicator in the table and try to get this rewritten in MSTR for aggregates, which seems to be virtually impossible.

Another more practical problem is ocurring with consolidations: When a COUNT DISTINCT metric is defined, it seems that MSTR is not applying consolidations correctly, such as on product groups (how many distinct customers have bought products from a product group...?). At least the SQL it creates doesn't count on groups (only consolidation, no attribute on this level) but still on the more detailed level of products (on which attribute the consolidation was defined).

Finally slightly out of topic, is there a way to flexibly define a 'Rest' element in a consolidation?

Thanks for your thoughts!

Best regards

nlim · Feb 13, 2004

brmstr, very valid concerns about distinct counts, they apply for all products. Here I'll try to address just the first 2 since I'm not quite sure what a rest element is.

I think you are saying that you would like to have an some precalculated distict counts show up somewhere so these common ones don't have to be run against the base fact table. These precalculated counts are still "non-aggregatable", ie only when a query includes all the attributes in this summary table can you use this summary table. The same logic I think applies for level indicators.

So you can get decent performance for high level counts by creating summary tables and have reports coded to that table. eg. say you have a product, zip, #distinctcustomers table. You can then build a metric defined as sum(#distinctcustomers), and a specific report with product and zip (even with element filters) on it. You'd have to disable drills along the data hierachy, but allow drilling to another template with different attributes, and another count metric (with a different definition). I would use the datamarting feature to generate these summary tables.

The biggest problem is switching the metric definition at different levels, if you are at a summarized data level, you must use the column with no aggregation function. If you go deeper you'll change it to count(distinct) customer. This dynamic change of metric definition based on aggregate levels required on the query is impossible, and not manageable because you'd have to check for every possible attribute combination.

Consolidations are not advisable with distinct counts, or any non-aggregatable measures. The way consolidations work is more like a dynamic aggregation engine. The SQL pulls back the appropriate groups, and then performs the consolidation in memory. So distinct counts would not be possible. Let me illustrate. Say you have product group A, product group B, product group A+B. The SQL will pull back counts for A and B. But then it will add these 2 together for A+B. By the way, there is no way around this problem, since distict counts must calculated at the lower level, and cannot utilize any summary level counts.

The solution to this is to use custom groups instead. The numbers will look right for you in the above example. But this is less efficient because 3 passes of SQL are issued. The most efficient way is to run one pass for SQL for both product A, Product B; and another pass for product A+B. The bright side is that at least the tool can do the job where others fail. Something, albeit, slow is better than nothing. Can you imagine doing distinct counts with cubes...

my 2 cents.

brmstr · Feb 16, 2004

Hi nlim,

thanks for the very good answer!

Another solution could be to have MSTR include the distinct ID (i.e. customer ID) and then perform the calculation on the OLAP engine, but I don't have a clue how to achieve this or if consolidations would work together with it.

As only consolidations can be edited in the web but not custom groups, I'm afraid we are running out of options and need to accept what we get ...

With 'rest' I mean this: Given your example with products, you may want to have an element 'all other products', especially if today you can't tell what additional products there may be in the future. So you would like to have an element saying 'everything else not already included above' - I know this is possible with custom groups, and I know I can subtract elements in consolidations, but I wonder how to subtract from 'All' - or maybe it needs to be done differently?!

best regards

nlim · Feb 16, 2004

conceptually having MSTR keep the distinct IDs is no different from the database keeping the distinct IDs. So, I think the likelihood of that happening is very low...MSTR likes to reduce duplication of work. I still think databases are better at count distincts than any other product out there...Plus, MSTR would have pull the data out of the rdbms dynamically, not a good performance enhancer.

As for the "rest" problem: I suggest the following. Create the filters F1...F4 in the Custom groups. A "rest" is the same as ALL - filter selections. So I would create another filter with empty AND NOT F1 AND NOT F2 AND NOT...you get the idea.

Good luck.

Z3 · Feb 19, 2004

Log a case with support for enhancement. What you need is a 'framed' metric and metric extension. The idea is that the metric only works when all the attributes it intersects (as defined by dimensionality) are either filtered (with eq, not 'or' or 'in' or others) or group-byed, otherwise default to another metric, in this case then count distinct at base level. I believe currently you can do framed metric, but you can not automatically extend to default when the framed condition is not met. If you don't ask, you will never get it. So cry as loud as you can

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Count Distinct

brmstr

Technical User

nlim

IS-IT--Management

brmstr

Technical User

nlim

IS-IT--Management

Z3

Programmer

Similar threads

Part and Inventory Search

Sponsor