Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Surrogate key or Natural key

Status
Not open for further replies.

avjoshi

IS-IT--Management
May 12, 2003
221
0
0
I am having tuff time explaining to my client to use surrogate key for the dimension tables in the warehouse.
We are creating a small data mart. Fact table contains roughly 50,000 rows. Daly increment is about 500 rows. The largest dimension table has 3,000 rows.
Client wants to see the natural ‘dimension’ keys into the fact table. According to them ninety percent of the time end users will query the facts based on the natural key and it’s very rare that user will ask for additional information from dimension table. By storing natural key in the fact tables we will save unnecessary joins and complications that may arise because of that. Dimensions are ‘type 1’.
On the face of it the argument seems logical if what they say about user queries is true. My architectural instincts tell me to use surrogate keys.
I wanted your professional opinion. What do you think? Should I go with the flow or are there any other compelling reasons I should avoid doing it?


Anand
 
Use surrogate keys. What happens at the data modeling/database level is of no concern to the end user. The surrogate keys are invisible (or should be) to the end user. Surrogate keys, as integers, allow faster joins. In addition, it is difficult to support slowly changing dimensions without them. The natural keys are part of the dimension and should be indexed if users will be using them for query selection, but the primary unique key should be a surrogate. If they don't like the term "surrogate" key, use "warehouse" key or "synthetic" key.

Let me know if you need more ammunition. Stick to your instincts - the surrogate keys are the right solution - always.
==============================================

-------------------------
John Herman is available for short and long term data warehousing consulting and contracts.
 
Thanks John,

I’ll need more ammo. I am arguing with their DBA. He currently manages their online application systems. Not a strong warehousing background.
His argument is that the source keys do not change and since it is a type 1 dimension we don’t store history. If natural keys are stored in fact table then we don’t have to do any join to the dimension table at all.

Let me illustrate my point further with example.
Source tables,
Provider
Provider_no number,
Provider_name varchar2(30),
Provider_address varchar2(50)

Service
Service_no number,
Provider_no number, --natural key
Service_provided number --number of times service provided

Our survey of the reporting requirements showed that majority of the user queries are against the natural keys. E.g. the queries will be like,
Code:
select provider_no, service_provided from service_table;

It’s very rare that they will need the additional information like ‘provider_name’ or ‘Provider_address’ so, keeping provider_no in service fact will save extra join every time provider_no is queried upon.


Anand
 
I realized, my comment natural key should be on 'Provider' table and not on 'Service' table.

Anand
 
I can understand the natural desire to have natural keys in the fact table. Naturally, users relate to natural keys. Having a DBA with only OLTP experience does not help the matter.

What are the future plans for this datamart/subject area? If the future plans are unclear or will require expansion, consideration must be given to getting the correct base design now. If this will be a static table set, and never ever ever change, you might be OK with the natural keys considering the table sizes. However, with tables as small as these, the performance gains by having the natural key in the fact table if all the users want is the key will be negligible.

In summary, if the future of this application is not well defined, or if its future involves adding more fact tables for cross subject area analysis, I would go with the warehouse/surrogate keys.

---------------

-------------------------
John Herman is available for short and long term data warehousing consulting and contracts.
 
It is sometimes difficult to deal with DBAs especially when their experience is heavier in OLTP designs than data warehousing. Sometimes this is attributed to the thought of additional work (or plain lazy) to be done to maintain surrogate keys.

I just want to ask if you are using an ETL to automate the generation of surrogate keys and monitoring of slowly changing dimensions?
 
Yes, it will be an automatic process. The nature of this data mart doesn’t warrant the type 2 or type 3 SCD.

Anand
 
Hi Anand,
I think the situation that you are in demands you to firstly meet the customer expectation and give him the luxary of accessing provider_no from the fact table.

Now to satisfy your DWH instincts you should build the Provider dimension table and have the Provider Key added to the Service Table. And since you said you only load 500 records a day and type 1 SCD.. I don't think this approach will have an impact. Sounds stupid !!! I know...

This is the only way you can keep your keys independent from the production keys and sleep peacefully in the night...

But remember in future if your needs for the Provider Keys changes then you have to delete the Provider_no column from the fact table...So be careful in what you decide...

And let us know what you did !!!

*******************************************
 
After thinking about this situation some and how I would approach the situation I feel the route I would take is. If the DBA is not the client but rather works for the client I would approach your client with well documenteed arguements on why you feel that your approach is better than the DBA's supported by sound arguements. Include the Benefits of your approach performance scalability etc against the recommendations of the DBA inflexible design, may cause issues if needs change (and they usually do). Your approach should that you are the expert in this technology, after all thats probably why they contracted it out and did not do it internaly. A small mock up of each approach may be a good idea to show them that they can get what they want in the design you are suggesting. Many times internal people are resistant to external influences, and force their ideas rather than trying to be open minded. You have demonstrated your ability to be flexible now you just need to be political and decide with your client not just their DBA what is best for them.

Shoot Me! Shoot Me NOW!!!
- Daffy Duck
 
avjoshi

The use of surrogate keys is "transparent to the user", so, unless the user is technically skilled in DW, I don't think going to them would really help.

I would find some articles written by notable DW individuals that explains the adavantages of using surrogate keys, and forward these papers to the DBA you are working with. Then, let him/her "chew on it" for awhile. If the DBA is interesting in expanding their horizons, they will "see the light".

Good Luck!
 
Thanks guys for your valuable inputs. We finally managed to convince the DBA team the merits of using surrogate key.
We proved them, with actual results, that we can mask the physical implementation with user friendly views and this doesn't affect the query performance if proper indexes are created to support queries.


Anand
 
I bet the DBA group just loves you now.

Shoot Me! Shoot Me NOW!!!
- Daffy Duck
 
I think you always offer the best solution (surrogates)and the reasons - to cover yourself when they invariably change their minds down the track.
However if it is going to cause you grief arguing against a stubborn DBA, just make sure your case is documented and carry on with natural keys if there is no need for history -you will have covered yourself and not gone head to head with an antagonistic DBA.

 
I'd like to get back to one of the first points in the first thread. If, for instance, there is a natural key called ProductCode in the production OLTP source system, and this code is 4 bytes in length, and the users will commonly be searching against this code in their "where" clauses, then what is the advantage of putting this code only in a product dimension table joined by a surrogate product key, and then forcing the user to join to the dimension table to include the natural product code in their "where" search? Doesn't it make more sense to just include the natural product code directly in the fact table and indexing it for searching? Wouldn't this be faster than requiring join processing to do the search?
 
Remember that surrogate keys are not just for join speed. One of the primary purposes of surrogate keys is to deal with slowly changing dimensions. If you can swear or get your users to swear that that particular natural key will never never ever refer to another product and that all the attributes of that product code will forever be constant, then I will feel a little better with you putting a natural key in the fact table.

-------------------------
The trouble with doing something right the first time is that noboby appreciates how difficult it was.
- Steven Wright
 
I fully understand the benefits of the surrogate key, including slowly changing dimensions, its just that this situation regarding users searching very often on a natural key never seems to get addressed in discussion or articles that advocate surrogate keys. In fact, my inclination would be to go ahead and create a surrogate key, but still include the natural key indexed in both the fact table and the dimension table, for those natural keys that are commonly used for searching. Any arguments with this approach?

But, can someone confirm my question? Would it not be faster in a query that searches a small natural key if the natural key is directly in the fact table and indexed rather than isolated to the fact table?
 
With the advent of new database and hardware technologies the difference in response time between primary key/foreign key search and single table index search is minimal.
We can simulate a natural key to be in a fact table by virtue of views.
If the dimension is type one and is guaranteed to remain so one may argue in favor of keeping natural key in the fact table. In my experience, most of the data marts start out that way and during the course of expansion a need to store historical changes to the dimension table also increases. That leads to the creation of type two or three SCDs.
IMHO that’s the reason most of the time storing natural key to fact table is discouraged. But I would welcome other opinions on this topic.


Anand
 
With regard to speed of the natural key in the fact table, I think the "Best" solution is to test it in your own environment. My gut feel is that it will be quite variable based on your server, database, row counts, indexing, and tuning. If it works for you, great.

We must remember that if there was one perfect solution for Data Warehousing, some vendor would have invented and copyrighted it. But it's not that way, so we all have jobs as DW Specialists.

-------------------------
The trouble with doing something right the first time is that noboby appreciates how difficult it was.
- Steven Wright
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top