Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Source Fact and Aggregate Fact tables?

Status
Not open for further replies.

TJRTech

Programmer
Apr 8, 2002
411
0
0
US
Okay, we have the need to model a "source" fact table that has discrete facts about people viewing online movies, on demand. This fact table will have one record, per view, including start date/time and viewing duration. The time dimension is broken into a 15 minute grain (96 records per day) and very rich. Let's call this fact: MOVIE_VIEW.

Then, we want to somehow "roll" up these movie views into another star schema with the same time dimension, but such that a single view from the MOVIE_VIEW fact table will "explode" into multiple records in the new MOVIE_DEMAND table. The reason for doing this is we want to be able to track peak demand for movies viewed.

A MOVIE_VIEW fact corresponds to the start of the movie and the movie run length, and is singular for each view, whereas MOVIE_DEMAND is an aggregate of a movies demand, in 15 minute increments. So, if a movie is 90 minutes long, and played to completion, it would generate 6 MOVIE_DEMAND records.

Is this a good way to do this? Are there better ways?

TR

Is this
 
It depends on what questions you are trying to answer.
If the fact explosion (I am using your term for the lack of better one) is with regards to only the time dimension then you can answer only a subset of queries that can be answered by movie_demand.
Can you give us some insight on what type of queries you expect on new exploded fact table? That might help us determine whether this is a good approach.


Anand
 
Yes, I can provide some examples.

With the MOVIE_VIEW fact table and associated dimensions we will be able to track who viewed what, when, and get a feel for what the popular movies are on a per "start" basis. Likewise, we can see what periods of the day, week, month, etc are popular for movie starts.

But, we have the need to track more than just movie starts. We also need to track and report on simultaneous movies being played, and do that for various periods (sub-hour, hour, days, weeks). This is more of a utilization reporting need. I was going to have the MOVIE_VIEWS explode into a MOVIE_DEMAND, which is sort of a reverse aggregation (it generated MORE records net less).

We seem to be able to get away with a time dimension that is every 15 minutes (~35k recs/year)

To my way of thinking, a 90 minute movie played at 1:30pm would generate one record in the MOVIE_VIEW table with a NUM_VIEWS agg fact of 1 and likewise generate 6 records in the MOVIE_DEMAND fact table; one each for the 1:30pm, 1:45p, 2:00pm, 2:15pm, 2:30pm, and 2:45pm time dimensions.

By doing this we can run reports that give us peak demand at various times (months, weeks, days), and hopefully drill down and see what movies generated that demand. Movies play on particular equipment and said equipment is rated as having a max capacity (thus an equipment dim table), so we can get utilization which is aggregated load / capacity.

Is there a better way? Should I just do some sort of a calendar table with the 15 minute periods and join that against my MOVIE_VIEW table, do aggregation that way? Actually, my time dimension table will be that calendar as it will be pre-populated and static.

THANKS,

Regards,
TR
 
Ah polling!!

In my opinion from the query point view you are better off creating another fact table MOVIE_DEMAND.
Other option would be to create view to simulate explosion but that has a potential of negative impact on response time if not structured properly. If the response time is of paramount importance to you then I think easiest way to achieve that wiould be creating exploded table.

Hope this helps.

Best of luck.


Anand
 
Your best bet is to create two separate FACT tables. MOVIE_DEMAND and MOVIE_VIEW.

Going by Books as u are changing the granularity of your fact table us should have separate TIME demension also. But i do not see any added advantable unless u are going to do any factless reporting.
 
Just to put closure on this...

We ended up creating to fact tables. One a source fact that had single records for each of the movie views; and another that was "exploded" from the prior source movie views source fact table to create demand.

What we ended up with is each 15 minute time slice in the movie demand table being an aggregate of movie viewing. If two movies were viewed entirely during a period there would be 30 minutes of movie_view_minutes...if one as viewed the entire time, but another only during the last 5 minutes of that period then there would be 20 view minutes.

The query to agg and populate the aggregate fact really crawls, but we will run that daily, and off-hours.

TR
 
You could have created a view for MOVIE_DEMAND. This would have saved on storage space. It's not necessary to actually populate a MOVIE_DEMAND table, since you can simulate it with a view.

Also, any aggregate tables you need to create don't require the exploded view, you can calculate the numbers from the MOVIE_VIEW table, getting better performance.
 
Yes, we could use a view, and we even populate the aggregate fact table from a view against the source fact table, but its just too slow to do that aggregation as needed during reporting.

MOVIE_VIEWS gets 10s of thousands of entries each day, and each record has a date time column for the view start, and a date time column of the view end (datediff of the two gives duration).

A single MOVIE_VIEW record can contribute to SEVERAL MOVIE_DEMAND facts (thus the explosion). For example, a 90 minute view will contribute to the aggregates of three MOVIE_DEMAND records, since each record summarizes the movie demand for a 15-minute block (there is a TIME_BY_QUARTER_HOUR dimension that has 96 records per day, for several years).

We use the MOVIE_DEMAND star schema independently in some reports and in conjunction with a CAPACITY star schema for reports that need utilization (util = demand / capacity). In this way we can track utilization at 15 minute increments.

Given the volume of the source facts and the rather complex "explosion" that takes place, the view against the source fact table just takes too long to go against at report time. Instead, we have a batch job that runs daily and rolls up the aggregates from the previous day and fills the aggregate fact table. Then, the reports scream.

TR
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top