Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Multiple Sources for dimension tables

Status
Not open for further replies.

tshad

Programmer
Jul 15, 2004
386
US
I have a project where I am building a Data Warehouse and am using SSAS, SSRS and SSIS to build and use my dimension/fact tables. I had it working fine for one database but need to use the same tables for the other databases the company has.

The issues is that there are multiple databases using the same table structures for the different companies it is involved with.

They all have some of the same stores in them. I had thought about using surrogate keys with the actual keys as the alternate keys so I can find the correct store from the base tables. The problem is that if I have 5 databases and is the same store in all 5 database, I will have 5 different store records with different surrogate keys and natural keys.

I was thinking about making a table that cross references one of the stores and use that table to get the correct store from the dimension table but am not sure that is the best way.

I was trying to find a best practices way of doing this but can't seem to find anywhere where they show an example.

 
I often have the same dilemma, whereby I want to do what industry practice recommends. Unfortunately, this business is not an exact science and is full of exceptions. Personally, I don't like exceptions, because it gets difficult to keep track of all of them over time. And then when one exception from the far past comes to light, I often do not remember the details of why.

Typically, If I have 5 sources, I will have 5 dimension records (i.e 5 surrogate keys). That way each source is responsible for its own records, and when updates occur, you know which source made them.

Having multiple records is not usually a problem for dimension tables because dimension tables tend to have few records relative to the number of records in fact tables. So don't be too concerned with tables that have under 100,000 records. However, some dimension tables can get large (especially if they are of the type-2 kind - maintaining generations of historical changes), which may incline me to share records among sources. Even then, I would avoid it unless I was having a performance problem.

I have one case whereby my dimension table is maintained by one source but the records in the fact tables that reference it are from other sources. This means when I am establishing my foreign key relationships, I must find the appropriate dimension record by converting the source-id of the fact table to the one used in the dimension table. So, some sources are converted, while others are not. By default they are not. Those that are not have their own dimension records, while those that do share records.

Before you take this route, be sure that the records are truly the same - not only in structure but in content. For example, one of your sources may be reporting in a different currency. Or what happens when the currencies are the same today, but not in the future? Or maybe another database comes on board that is the same in structure but different in content? In that case you will want to set it up so that it retrieves it own dimension record instead of the shared one.





 
In my case, I want to report on a retailer (Sears for example) or for a particular Vendor (i.e. Sony). Wouldn't having multiple records for the same Vendor or Retailer be an issue with any cubes I set up or reports I generate from the cubes?

I would assume that we would want only one record per vendor in my DimVendor and one record per retailer in my DimRetailer tables. If I had multiple records for each retailer or vendor, wouldn't that screw up the cubes?

I had thought about having cross reference tables for them with the PK as the surrogate key.

For example:

RetailerXref table
1) PK (identity) - This would then be the Surrogate Key for the DimRetailer table
2) RetailerID (from a Source and would be your TransactionalID)
3) Source ID (1 for DB1, 2 for DB2, etc)

VendorXref table
1) PK (identity) - This would then be the Surrogate Key for the DimVendor table
2) VendorID (from a Source and would be your TransactionalID)
3) Source ID (1 for DB1, 2 for DB2, etc)

In my ETL, I would use the RetailerXref table with my RetailerID that I get from my DB to get the Surrogate key which I would use in the Fact tables. This Surrogate key would also be the key for the DimRetailer and DimVendor tables.

I wasn't sure if this were the best way to do this.

Thanks,

Tom


 
I think you will need to use a cross reference table somewhere in the process, but it's a bit bigger than that. Someone in the business, outside of IT development needs to own this data. You might end up creating table(s) and application(s) to assist in mapping this data to source your dimension. However, I wouldn't try to implement some sort of fuzzy matching and do it all yourself. Someone should be in charge of keeping on top of mapping these elements to each other. Look up master data management on Wikipedia.
 
I've encountered this problem in at least two scenarios. First, when you've bought another company and you need to report on both. The other involves subsidiary companies who have enough autonomy to have some different table structures and values.

For the Vendor/Company problem, we created a new master record in the warehouse, sort of a giant cross-reference which we called the family record. Whether you actually translate the incoming key values or store them in a separate column is a design consideration.

MrRiverGuy brings a good point about "data stewardship", or, who ultimately owns the company and vendor data. They should work closely with you as you assemble the "family" records.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
I agree on the xref table. That was what I mentioned in my reply.

The company we are working with is working closely with us on the issue. We know where the Vendor information is and can match them. The issue is only that the VendorID or RetailerID would be different coming from each database for the same Vendor or Retailer. And do I create 5 records for the same Vendor in the Dimension table each having its own Surrogate key which is used in the fact table and each with and alternate key from the source system and a sourceID to determine the db it came from). This is what dkyrtata mentioned and I have seen this in other examples but not for records that would be duplicates - just for tracking where the data came from.

I am not sure whether this would mess up the cubes or whether an extra xref table would be the correct method.

Thanks,

Tom
 
One rule of thumb to remember in DW is that we NEVER rely on an external system to privide our unique keyfields. NEVER. So, all the Customer Families would get their own unique key. So, all the Walmarts would get a new surrogate key, a unified key across all Walmarts while we'd retain the source system keyfield as well. The question would be whether we'd add a column into the Customer table to hold the new (unifying) surrogate key, or have a cross reference table mapping them. Or both. From a performance perspective, less joins are better.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
tshad

I think your proposal of using the 2 xref tables to point you to a single dimension record is reasonable. However I am not clear which of your 5 DBs ends up updating the single dimension record. In other words, which source actually owns it? If all 5 can update it, I would think you would get 5 updates every night with the same information for a given record. That being the case, what if one of your DB's is two days out of date (for whatever reason) and is the last one to update the record? You would end up with an outdated dimension record. If that is a potential risk, I would only accept updates from one DB.
 
dkyrtata

Yes, that would be a possible issue where one could be updated in one DB and not in another. Not sure how we would handle that for the stores if the number should change but it wouldn't happen very often but a store could close and then they reuse the store number but I wouldn't expect that to happen. Vendor and Retailer records shouldn't change at all.

johnherman,

I assume that when you mention all Walmarts would get their own surrogate. Correct, that would be the DimStore table and each store would get its own number.

The xref table would be the one that would hold the different source system ID. The Customer record would use the Surrogate key that comes out of the xref table. This is also where we would retain the Source System ID and the Source ID. We can't have these in the dimension table or we would end up with multiple sets of records in the dimension table. I wouldn't want 3 records with Walmart store# 100 in the dimension table.

Thanks,

Tom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top