Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Process Driven Fact Table Design??? 2

Status
Not open for further replies.

flurk118

Programmer
Feb 8, 2008
21
GB
I have a DW design question that I would like to float out to all you experts out there with the (rather selfish sorry) intention to help me come up with a suitable design with which i am struggling with.

I am required to build a warehouse, and appropriate cubes for mining, for my customer.

The operational database can be described as follows (simplified for first post):-

Product
Customer
Sale
SaleStatus

Now a customer can purchase a product. The purchase process can span a period of time (days/weeks). In the process the sale moves through a number of statuses (submitted, verified, approved, declined, sold etc). The sale can be move forward and backwards though these statuses which means the database holds multiple same statuses per sale. Status changes can happen multiple times in a single day.

My customer needs to report on the sale pipeline - which sales went through which statuses.

Product, Customer, Sale tables all can change daily and have elements that will need to track history.
All status changes are date time stamped and stored in the SaleStatus table.

My initial thoughts are a star design with the central fact driven by the events that occur within SaleStatus table.

My main concern is double counting and historic slicing.

To remove double counting I can set a flag that depicts the current sales status during the ETL and i have also thought about flagging the first and last occurrence of each status per sale.

But this does not help me when slicing by a date - what is the current status of a sale today is not what the current status was last month. Hmmm...



Hopefully that is enough info to drive a discussion but please ask if you need more.

Any help with this would find an extremely grateful person on the receiving end...

 
Regarding the lifecycle of a sale.

Does each sale go through the same set of status in the same order? If yes, perhaps the status can be denormalized as a date-time attribute for each of the statuses. For instance, DT-submitted, DT-verified, DT-approved, DT-declined, DT-sold, DT-paid, etc. Thus you associate a date with the time each status was achieved. And, if the lifecycle is consistent, you also know how long each sale stayed in that status, perhaps useful as well.

To avoid nulls which can cause join problems in Star Schemas, a ficticious date is used for DT-declined if the sale is not declined, say "31-DEC-3000"




-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Thanks for your response John.

I have considered this option but...

Although, from a high level, each sale does progress through a preset pipeline, certain statuses are split to a finer granularity.

To explain, a sale enters the system at "Submitted" status. It then moves through a rules engine resulting in either a "Accepted", "Referred" or "Declined" status. These statuses make up the "Decision" status.
There are a number of grouped statuses like this and in total there are 20+ distinct statuses. This number is controlled by the operational system and of course can change as business logic evolves. A standard pipeline sees, on average, 8 distinct status changes.

If the sale route was progressed in a linear fashion then I could see the date design working but unfortunately a sale can move back through the pipeline and forward again - resulting in multiple status entries.

My suggested design of all entries existing in a single fact makes data mining very difficult.
If, for example, you wanted to report all "declines" for Q4. The business would want to count all the cases that had seen it move into that status during that period but you have the risk of double counts as individual sales could have entered that status more than once.


I have had previous experience of building data warehouses but this one is really taxing me.

Help...

 
If you store a "movement" attribute (either up, down, neutral) in the fact table, you should be able to get a distinct count of all sales numbers which had a downward movement. This could be something you could calculate in the ETL process.

For historic slicing, you might be able to look at having an "end of day" status fact table in order to compare apples to apples across time.
 
Thanks for your post Riverguy. A couple of interesting points there but could you expand on them for me.

To help with clarity let me put this scenario to you (it's a bit convoluted but real)...

A sale comes in (Id=S0001) to the system on 1st February,
moves through the decision engine which results in a referral.
After an agent reviews the case, on the 2nd March, it is moved back through the decision engine and this time accepted.
Case then moves through "Application" on the 15th March.
Case then sees critical data change (25th March) so is moved back to submitted and back through the engine again which another Accept result. Then, finally, moved through "Application" to "Complete" on 6th April.

Data would look like this...

Id Status StatusDate
S0001 Submitted 20080201
S0001 Referral 20080201
S0001 Submitted 20080302
S0001 Accepted 20080302
S0001 Application 20080315
S0001 Submitted 20080325
S0001 Accepted 20080325
S0001 Application 20080406
S0001 Complete 20080406

If this data was used to poulate the fact then slicing on March would give you double counts for submitted etc.

Hope this helps drive discussion...


 
First question, what is an example of the output you are looking to get?
 
The requirement is to build a cude for data mining.

User will slice by Status ("Submitted", "Application" etc) and by date period (Year, quarter, month etc).


So, example view would look like...

Q1 Q2 Q3 Q4
Application 20 18 32 23

 
So in this example, you are saying that 20 sales were in a submitted status in Q1? Does this mean that at some point in time in Q1, there were 20 sales in submitted status, or, on the last day of the Quarter, there were 20 sales in submitted status? Or 20 sales spent most of their time in the submitted status? Or some other logic? Take the following example:

Code:
                   Q1  Q2  Q3  Q4
Application         5  5   5   5 
Complete            5  5   5   5
Total               6  6   6   6

Five plus five does not equal six. But it's not necessarily incorrect, and can be represented in a cube. Based on what we have to go by, I don't think you are going to be able to assign each sale to only one status for every period of time, unless you divide it into fractional counts.
 
My example was your first dexcription - at some point in time in Q1, there were 20 sales in submitted status. And this included 2 counts for my example "S0001".

Your last statement has hit the nail on the head and is what i'm struggling with - assign each sale to only one status for every period of time.

Do you think this is achievable with fractional counts? If so, could explain how or at least point me in the right direction?

You help is very very much appreciated.

 
I would keep a detailed fact table and a summary table of current status.

Your statements above have steered me away from denormalizing the Status Dimension in the detailed star schema, but I would build a summary star schema, updated as frequently as your reporting demand which contains the current "snapshot" of your sales and their statuses.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Another question that comes to mind.

Looking at your sample with sale S0001.
This has a state of submitted in februari (Q1), march (Q1) and april (Q2). Do you want to count it once for every month (Quarter)? Or do you want to count it Once only?

For me it is a good example of how business rules can influence a design.

@John: I am not sure the summary star-schema is really suited for this case. This may depend on this: Do you have to report (rebuild youor cube) for some given day in the past? or just for the current period?
 
Thanks for your responses guys.

Hans,
For what ever time period is being sliced on I want to report 1 count per case.
So if a user wanted "Application" figures for March & April it would bring back 1 count for S0001.

I want the cube to be flexible and be able to perform both current view reporting and historic off it.
For the current view I will flag the latest status as current during the overnight ETL - nice and easy - but the historic element is stumping me.


 
Did you consider technique of having active flag. That is along with the Status date, have an
end date when that status was active. As soon as sale goes thru a status again..previous status
becomes inactive. This way a sale will have only one status active at a time.

Id Status StatusDate Active Flag End Date
S0001 Submitted 20080201 A Some Future date
S0001 Referral 20080201 A Some Future date
S0001 Submitted 20080302 I 20080325
S0001 Accepted 20080302 I 20080325
S0001 Application 20080315 I 20080406
--------------------------------------------------March 15th
S0001 Submitted 20080325 A Some Future date
S0001 Accepted 20080325 A Some Future date
-------------------------------------------------March 31st


S0001 Application 20080406 A Some Future date
S0001 Complete 20080406 A Some Future date

Cuurent analysis =

A sale will have only one status active at any time and that would be the current one.

For historical analysis -

Status date is <= slicing date and End date is >= slicing date consider records active

For March 31st ...Reicord marked as I will not be counted.

For March 15th...Records marked as I would count.

Hope this is of some value..
 
Thanks for that jtamma.

Doesn't really help as you can still incur double counts for date ranges.

Take a report that wants all Submitted cases between 1st March and 31st March using the following data...
Id Status StatusDate Active Flag End Date
S0001 Submitted 20080301 I 20080312 S0001 Submitted 20080312 I 20080501 S0001 Submitted 20080501 A 20991231

Both Inactive rows would be included.

In SQL you would simply group by the Id.


I ask... am I attempting the cubing impossible????





 
Cor, that's bad formatting. Try this...

Code:
Id       Status      StatusDate  Active Flag    End Date 
S0001    Submitted   20080301    I              20080312
S0001    Submitted   20080312    I              20080501
S0001    Submitted   20080501    A              20991231
 
Hmm, Flurk.

What size of datamart/cube are we talking?
What environment will you be using?

Can you use a form of a perdiodic snapshot, with a daily grain, using 1 as a measure and sale and status as dimensions? Perhaps fill this fact table with only the occuring records.

And then use the max() (or better a first() ) function to aggregate to week or month level?

Just a quick respons, so I may be completely off.
 
Flurk,

Sorry I dont see problem with date range..

For historical analysis - you will have to ignore the active flag - you will kind of construct the active flag on the fly based on the date ranges..by comparing the date range appropriately with status and end date.

Since you are updating your fact by updating flag ..so only way to see fact table as it was on a historical date is to recreate this flag using dates for historical queries.
 
Thanks for your posts, much appreciated.

Hans,

The if the fact is built using the status table it will hold approx 500,000 rows and will reside in SQL 2005 Analysis Services.

If I understand you correctly you are suggesting a fact less table with a snowflake off to association SaleStatus SCD.

This is going to highlight my inexperience but hey ho... Could you elaborate on the max() and potential defined first() functions? Are these defined with AS? If so, links to example sites would be handy.


Jtamma,
You may be on the same wavelength as Hans with your last post as, yes, what i need to do is recreate the "current" or "last" flag from a historical viewpoint.
Without a function I cannot see how a long date range would not bring back multiple rows.

Great couple of posts guys. Has really put some ideas out there and i think i can see a light in the distance (although it's still just a glimmer!!!)





 
Thanks jtamma for the link. A very interesting article.

If I have understood the design properly the fact table is populated at the end of each month with a single row for each existing employee. The row is keyed to the transaction dimension using the transaction id of the last transaction in the month for that employee.

I can see how this would satisfy the first requirement but, bringing my dilemma to the table, what happens if an employee sees a number of transactions for that month and a report is requested that asks for a count of all promotions in month 1. If an employee had been promoted but then transferred you would not be able to query though the fact/cube???

Also, and even more relevant to my problem, what if the employee was promoted twice in that time period and the report specifically requested the number of people that were promoted (they do not want to count the same employee twice)?
Using a query on the dimension and a date range of 1st of month to 31st (or last day) of month will pick up both transactions.

Am I wrong?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top