Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Process Driven Fact Table Design??? 2

Status
Not open for further replies.

flurk118

Programmer
Feb 8, 2008
21
GB
I have a DW design question that I would like to float out to all you experts out there with the (rather selfish sorry) intention to help me come up with a suitable design with which i am struggling with.

I am required to build a warehouse, and appropriate cubes for mining, for my customer.

The operational database can be described as follows (simplified for first post):-

Product
Customer
Sale
SaleStatus

Now a customer can purchase a product. The purchase process can span a period of time (days/weeks). In the process the sale moves through a number of statuses (submitted, verified, approved, declined, sold etc). The sale can be move forward and backwards though these statuses which means the database holds multiple same statuses per sale. Status changes can happen multiple times in a single day.

My customer needs to report on the sale pipeline - which sales went through which statuses.

Product, Customer, Sale tables all can change daily and have elements that will need to track history.
All status changes are date time stamped and stored in the SaleStatus table.

My initial thoughts are a star design with the central fact driven by the events that occur within SaleStatus table.

My main concern is double counting and historic slicing.

To remove double counting I can set a flag that depicts the current sales status during the ETL and i have also thought about flagging the first and last occurrence of each status per sale.

But this does not help me when slicing by a date - what is the current status of a sale today is not what the current status was last month. Hmmm...



Hopefully that is enough info to drive a discussion but please ask if you need more.

Any help with this would find an extremely grateful person on the receiving end...

 
I donot think the whole design is applicable to you just where it says
"A crucial piece of the design is the second date/time entry: transaction end date/time"

I guess it is the date range that is creating confusion.
I am trying to understand where is the problem with the date range..

Id Status StatusDate Active Flag End Date
S0001 Submitted 20080301 I 20080312
S0001 Submitted 20080312 I 20080501
S0001 Submitted 20080501 A 20991231


So if I change the condition like this

Consider all the records active where

StatusDate >= range start date (March 1)
and StatusDate <= range end date (March 31)
and End Date >= range end date (March 31)


Will it work for date range.

My thought is this..

At any given time only one status would be active so in theory,
.if you use start and end time properly - with your date range..
there is no reason why you should pick two status ..

Line 1 - March 1 to March 12
Line 2 - March 20 to may 1
Line 3 - May 1 to Dec 12

May be there is some business condition that I do not know..and things are not as simple as
I am making it..
 
I think the confusion lies with you slicing on a single date and myself wanting to use a date range.

For example, you’re asking the question… What was the status on 10th March?

Code:
SELECT * FROM StatusTable WHERE StatusStartDate <= ‘20080310’ AND StatusEndDate > ‘20080310’

Whereas I’m asking… What Statuses were created in Q1?
Code:
SELECT * FROM StatusTable WHERE StatusStartDate >= ‘20080101’ AND StatusStartDate > ‘20080331’

With my DW designs I use a standard Date Dimension that allows me to aggregate by Year/Quarter/Month etc. So the fact table I’m proposing will carry the date of status change which will foreign key off to the Dim_Date table.
This allows me to slice using dim_date.quarter = Q1, it will pick up all events that occurred within that range. This is not set in stone of course as it may not be possible to use this dimension if I want to achieve my goal of returning distinct counts.





 
My last reply shows how the same approach can handle date range.
 
Accpeted, but just with the sample data you detailed.

Add another inactive status for that period and then it would return two rows. As follows...

Id Status StatusDate Active Flag End Date
S0001 Submitted 20080301 I 20080312
S0001 Submitted 20080312 I 20080325
S0001 Submitted 20080325 I 20080501
S0001 Submitted 20080501 A 20991231

:(
 
Are you sure ? I think it works even for this one,No matter how many inactive records in the period, there will be only one that would have end date greater than the range end date right ?
I am sorry, if I am missing something. Hope you find a better solution than this. Good Luck.


 
Yup, you're right (Me being blinkered by my thoughts there). I can see that your query would return 1 row.

Your query basically returns what was active on the 31st March. This is not the question I am posing though.

I need "What statuses were active at any point between 1st March and 31st March."
If a sale had a 1 or more statuses of the same type active then I only want to count that sale once.
 
In pseudo SQL the answer would be something like:

select distinct month, sale, status, count( distinct( month, sale, status))
from <a table>
group by month, sale, status

Correct? And now you want to know how to design a data-structure that can readily answer such questions and does not take eternity to get populated.

It is a nice, tricky situation. Finally a challenge :)

Coming back to your question about max() and first().
I am not at all familiar with the features SQL-server offers. I am familiar with Oracle.
In Oracle there is a max() aggregate function which returns the maximum value for a given attribute.
Using this function on the periodic snapshot I described, gives you a 1 for every occurence of sale, quarter (or month) and status.
This will work for the aggregations across time. Aggregations across the other dimensions will probably need a sum.

So actually you have a semi-additive fact here.

The first() function should return the first occurence of a resultset (preferably without computing the entire set). Oracle analytics has features for this, but does (in 10g) compute an entire resultset on the way (at least I think it does).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top