Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Time/Date Dimension, Surrogate Key or not

Status
Not open for further replies.

lbe

Programmer
Oct 29, 2007
6
DE
Hi all,

I'm about to design a Time/Date Dimension (on Day level) for a customer and I'm convinced of doing it the Kimball way is as close as perfect you could get (ie creating a spreadsheet as source). Roughly. I'm not convinced that the use of a surrogate key is the best choice. In Kimball's "The Data Warehouse Toolkit 2ed" he suggests that a surrogate key is preferred.

One reason is size. But a smalldatetime (SQLServer 2k) and an Integer are both 4 bytes. Doesn't that mean that performance is the same?

Another reason is how to handle Unknowns. With Surrogate keys you'd have the value -1 for unknowns but why not have the value '1900-01-01' for unknowns and '1995-01-01' for "Early dates", etc? You'd still have the recommended DATE_TYPE {'Normal','NA',..} field to separate the unknowns from the normal dates.

A reason to use the smalldatetime is that it's much easier as a developer to deal with in the development process (where you don't have to write joins all the time) and, in the end, saves time.

Another benefit of using a smalldatetime instead of a surrogate key is that you don't have to worry about getting the keys in the correct order.

Now my question. Can you come up with any other reason why I should use the surrogate key on my Time Dimension? (I've already searched the forum for answers on this topic.)

Note, I'm all for using Surrogate keys in all the other kinds of dimensions.

Thanks
 
Hi Ibe,

I have encountered this very same issue. As you have already pointed out, there are some pro's and con's to using surrogate keys on a date-dimension.

I am not sure about the performance difference in using a date vs a integer. That might depend on how the RDBMS handles them internally. But for size only, it should not matter.

Reasons for using surrogate keys:
- keeping your model consistent.
- (directly from Kimball) Refrain from putting ANY interpretation in the key.
- You may not know beforehand what range of dates will enter the datawarehouse (due to typo's for instance. I have seen order dates from 2950AD and from 610AD.) In these cases a predefined date for '(negative) infinite' and 'not a date' and 'unknown' may be ambiguous.

reasons to use datevalues are:
- understandability for report developers (or universe developers).
- the ability to quicker perform some actions direct on the date-values (e.g. days between, but not working days between!). This only applies in a very limited vaiety of actions.

I hope this helps a bit.
 
I don't use a surrogate key on date dimensions. I either use a DATETIME/SMALLDATETIME, or an integer that makes visual sense (20071029, etc.). For me, the reason for doing this is development/debugging. Almost all issues where data needs to be validated will have a date component to them, and its so much easier to do this without joining to the date dimension each time.
 
If you need granularity on the date any lower than one per day, you need a surrogate key (in my opinion). If your time granularity will be one day, and you are confident that it will never be any smaller, then go ahead and use an integer. I am not sure about the join efficiency with smalldatetime. I know integers join well, so CCYYMMDD is a good key.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
That's a good point John makes. I favor keeping separate Date and Time-of-Day dimensions (although different data models could require otherwise). In that case, a surrogate key definitely makes sense for a Time-of-Day dimension, instead of using a SmallDateTime with 1900-01-01 hh:mm.
 
Thank you guys.

Actually there is a need for a 15-minute granularity but my guess is that it'll be on just a few fact tables and most will be on the day level. I'll surrogate that dimension.

I'm still not sure what to use for my Day level Time Dimension. John suggests an integer with built in logic and RiverGuy finds that acceptable too.

Let's say that performance between an integer and a smalldatetime isn't an issue. Wouldn't you prefer the smalldatetime because you wouldn't have to transform every incoming Dates to the CCYYMMDD rather just truncate the HH:MM part of the Date and sum those rows for the fact table? Or should you create a CCYYMMDD-key for every interesting Date down in the Normalized Layer?

 
If I had to choose, and performance was not an issue, I would choose smalldatetime.

When you say truncate to the day level and sum the rows for the fact table, are you meaning to build an aggregate fact table on the day level? If at all possible, I would consider building the fact table down to the smallest possible granularity, even if the dimensions it joins to don't reflect that.
 
Yes, I mean that I'd build an aggregated fact table on the day level. In the normalized layer I'd keep the smallest granularity, of course. I could create a FK DATE_ID that points to the corresponding DATE_DIM in the norm layer. This would probably make the ETL to the mart much smoother. But the actual fact table in the mart must have the same granularity that its dimensions have. Or have I missed something essential? Could you please explain to me what you meant in your last sentence?
Thanks
 
The latest Kimball recommendations go with having the fact table grain at the lowest level possible. So, one record for each transaction. Now, if you're going to have one billion records on a 32-bit server with 4 gigs of RAM, then yes, you might want to aggregate the fact table. However, in a normal situation, you should be able to get performance with your fact table down to the transactional level. A record in your fact table might look something like this:

Code:
factMyTransactions
------------------
SourceSystemID / 00001
TransactionTime / 2007-01-01 03:27:13.003
DateKey (to Date Dimension) / 2007-01-01
SalesAmount / $100.0000
etc.

In this scenario, the actual transaction date/time is not joined to your Date Dimension--just the DateKey. Another benefit is that it is extremely easier to code your ETL. When a transaction record is added or changed in your source system, all you have to do is add or update one record in your fact table, instead of creating logic to create a new aggregate record. I've done this both ways (aggregated to business specs and at the transaction level) for the same subject matter. The transaction level was much easier to work with.
 
I've browsed through the "Toolkit" and ofcourse, you're right. My experience told me differently but that says more about those old systems than what's right and recommended.

I've performed a really simple but hopefully accurate test on the two datatypes (int and smalldatetime). For the integer PK I created one table with a built-in logic integer and one as a surrogate key integer.

I joined the tables on themselves and filtered on several ranges of dates and executed the SQLs a few times. The smalldatetime table took longest and the logic integer table took 60% of its time and the surrogate key table took 50% of the time.

Conclusion: I'll probably use a surrogate key for my Date Dimension after all.

Thank you all for valuable input.
 
Kimball does not actually say you have to use a surrogate key for the time dimension. In fact he implies that a date object probably would be used for the time key. Check out his first book Data Warehouse Toolkit where he talks about time dimension. Here's a direct quote in case you don't have the book (italics added):

"Most data warehouses need an explicit time dimension table even though the primary time key may be an SQL date-valued object. The explicit time dimension is needed to describe fiscal periods, seasons, holidays, weekends, and other calendar calculations that are difficult to get from the SQL date machinery."

I don't think there is anything in his later writings that contradicts this, though I could be wrong. Most people don't know his earlier book, which is too bad because he left out a lot of good design stuff like this in his later DW books. Funniest thing about his early book is, Ralph actually got Inmon to write the forward for it hehehe.

 
I really have to disagree with you about Kimball's later writings. I have not read the first Data Warehouse Toolkit but the 2nd edition thoroughly, in parts.

On page 60 (2nd ed) on the subject "Surrogate Keys" he strongly advocates the use of surrogate keys for the Date dimension too. "... your dimensional models have dates that are yet to be determined. There is no SQL date value for 'Date to be determined' or 'Date Not Applicable'. This is another reason we advocate using surrogate keys for your date keys rather than SQL date data types."

I guess he changed his mind on this matter since the first book. I don't think that it's a bad thing that he comes up with new ideas as time goes by. Conditions change and so must best practice advice. I warmly recommend his design tips:
 
Well that is interesting! I don't usually work with facts tables that need a date field where the date is either an unknown or a 'not applicable'. The facts I work with are transactional - financial transactions for banking, wholesale/retail sales, delivery shipments, inventories, etc. and these always have real dates. The reason why I find using actual date values instead of surrogates is that transactional facts tables get huge, into terabyte territory, really fast, and are excellent candidates for table partitioning. You can't partition on meaningless surrogate keys because you don't know the values in advance. Managing partitioning is a real bore if you have to do it after the fact (hehe good pun).

Interesting point though. Maybe its a case of what works best - Ralph never says that any of his rules are carved in stone, site-specific metadata and requirements need to take precedence.
 
If you need a pre-populated Kimball-style date dimension source and the SQL/code used to build it (at least in SQL Server) give my freeware date dimension toolkit a look.


Be sure to send a comment.

Best regards,
Don
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top