Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Schema question for multiple fact tables 2

Status
Not open for further replies.

datadb

IS-IT--Management
Jan 18, 2007
3
US
Hello,

I have been thinking about the best schema for a DW for clickstream analysis including sales.

What comes to mind first for web behavior is a page_view fact table with dimensions like session, user, and referrer. We also want to track sales, however, in relation to this behavior. So, can sales itself then become a dimension table of page_view?

Also, we want a classic sales_fact table with the usual dimensions like product, time and customer.

My question is though, it seems we have sales information then in two different places, one for analyzing in relation to web surfing, and another as an actual fact table. (In addition "user" for clickstream is also a customer dimension for sales_fact.

How best should we manage this? Is it a bad thing to have two separate star schemas where user and customer and identical and where sales (from page_view) and sales_fact are also quite similar? It seems like we are adding a lot of duplicate information. Is there a more elegant way to handle this?

I appreciate your input.
 
Kimball's ubiquitous masterpiece "The Data Warehouse Toolkit" (Wiley ) devotes an entire chapter to Electronic Commerce, including Clickstream Analysis.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Hi Datadb,

Our company is looking at doing the exact same thing! Currently we use a product, “Sawmill”, to rip through our Apache web logs to get stats. This program (which I know very little about) is resource intensive and pretty limited from my understanding. As far as transactions go, we have some scripts which extract info from our OLTP system and basically just dump them into an Oracle data store. We then run a process to take the transactional info and model it into a star schema. We have no “formal” ETL tool and I was wondering if I could be so bold as to ask how you extract your web session info?

I plan on picking up Kimball’s book as John has suggested, but this might be cool if we could share our experiences. After all, that’s what this place is for, right?

Hope to hear from you soon……


Nuffsaid.
 
Thanks for the input.

John- Actually, I do have that book. I have only read portions , however, and I don't see anything in the TOC about clickstream analysis. I will need to dig into that deeper (maybe there is a newer edition?).

I guess it comes down to this- is ok for me to have a clickstream-focused star schema with sales as a dimension of page_event, and then a separate sales-fact table that may contain redundant information? One is for analyzing web behavior, the other for analyzing sales. In some cases one might want to do both together though...

This is probably a common issue and perhaps it has already been done in some of these other packages. I will check out Sawmill.

To Nuffsaid- thanks for the tip about Sawmill. It is very early on and I am just researching things, but I will keep you posted!

Thank you.
 
I think I would keep the 2 seperatie fact tables with conformed dimensions and do cross analysis (drill across) on these 2. So one fact table with the clickstream facts and one for the sales. You might want to add an attribute to connect an internetsession with a sale. This could be just an attribute to an existing dimension or a whole new dimension.

I do not understand why the clickstream fact table should hold sales data.
 
datadb - the Clickstream star schema is on page 295 in the Second Edition, which is quite different (and more useful) than the First Edition.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Thank you very much for the replies.

I have modeled many OLTP databases, but not any for a data warehouse. I did pickup the Dimensional Modeling edition of the DW Toolkit, which is very helpful.

I have read somewhere on the web about including sales information in page_event if a page had a sale occur, so that is where I got the idea. In general, one might want to study the behavior of sessions that result in sales and those that don't, which would seem pretty important for many companies. I am sure others have solved this before.

I like Hans' idea. I think what I have been avoiding is the idea that fact tables would somehow be joined, but it sounds like doing 2 passes of two fact tables and joining based on headers across a common dimension is ok, a drill-across.

What would make the most sense here for a dimension to connect the two? Product does not seem to make sense since we want to tie it to the user. Doing it just on user does not seem correct either. Is it best to use the multiple dimensions of product, customer, and date and the same time of off both fact tables? Or, is it best to create a new dimension that they share, like session?

Thanks in advance for putting up with my questions, but I greatly appreciate the help in learning this stuff.
 
Your two fact tables will definitely have the date/time dimension as a shared dimension. Depending on how much info you gather from browsers, this might be your only shared dimension.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Like John stated: Date/time dimension is probably one to connect.
If you can identify/connect the session-user and the sales-user dimension, that might be another conformed dimension. Same goes for product information available on a certain webpage.
If you have session information available around your sales facts, you could connect via this attribute as well.

It depends largely on what dimensions you can conform. If you cannot conform the dimensions from the sources available, you will have a very difficult time trying to make this analysis possible.
 
Rather than tracking sales at the pageview level, I'd track it at a session level. Aggregate your pageview data at a session granularity and then you can either create a view over your session data and sales data to show sessions that have sales ("converted sessions") or create another aggregate to combine this data.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top