Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Advice on designing fact table loading package 2

Status
Not open for further replies.

Goodie666

Programmer
Nov 24, 2008
44
CA
Hello,

I'm designing my first SSIS packages for a small data warehouse I'm trying out as a first sample project. I've already designed my dimension packages, which are not so much of an issue.

Where I'm having trouble is in my fact table package. My fact table contains a number of FKs pointing to dimension tables, and I have to handle early arriving facts. The way I have to handle early arriving facts is to insert an empty row for the corresponding dimension entry, I know the data for the dimension will come, it's just that I can't assume the order of entries. So in order to determine whether I have what I need in my dimension table, I have a bunch of lookups in my fact package for each dimension.

I have lookup --> Error output Ole DB command to call Sp to insert empty dimension entry, then Lookup again and then Union to go back down the data flow. I have this for each dimension entry I have to lookup. I'm just wondering if this is an efficient way of doing things (I use caching in my lookups) and if people have other ways of achieving my goal that involves a less complicated data flow.

Many thanks,

Greg
 
Do you use staging tables on the same database server? If so, the following Execute SQL Task may be more efficient:
Code:
INSERT INTO DimensionTable (BusinessKeyColumn, Attribute1, Attribute2)

SELECT DISTINCT BusinessKey, 'UNKNOWN', 'UNKNOWN'
FROM FactStagingTable a
LEFT OUTER JOIN DimensionTable b
ON a.BusinessKey = b.BusinessKey
WHERE b.BusinessKey IS NULL

That would prepopulate your dimensions before you enter your Data Flow Task. Then, you could use joins in your Data Flow Task instead of Lookups, which may be even faster still.

However, it's up to you whether you want to go down the path of debugging a lot of SQL code or maintaining a more flexible and graphical package utilizing SSIS components.
 
Hi,

That is a very clever trick indeed. I didn't think about it at all I was so caught up in trying to achieve the results using the transforms provided.

Another scheme I was thinking about was to do these lookups as a separate container in my control flow to update the staging table with the dimension keys and from there select my staging table data with all my updated dimension keys so that I don't have any more lookups/joins to do.

I do see the tradeoff in that method though in that, as you say I would have to maintain t-sql code, but not more than having an OLEDB command that would do the insert in case of a lookup failure. And I would think the data flow would be much faster to execute without all the lookups to do. I could have an sp do that for me to keep as much code managed on the database side rather than directly in the SSIS package (which would be good since other facts would use the same dimensions, hence the same trick).

Thanks a lot for the tip, I'm definitely going to look into it :).

Greg
 
Easiest way to handle this is demonstrated in the "Project Real" materials.

To handle the early arriving facts you can use the following.

1) Set the error properties of your lookup tasks to ignore failure.
2) Use a script component that will
a. Check to see if the SK from the lookup is Null
b. If Null then insert the business key into the
dimension table.
c. To accomplish b use a stored procedure that returns
the SK created in the insert. Assign this SK value to
your fact record and plase it in a List object.
d. Before inserting the BK use a StringBuilder to check
your List and see if the BK has already been inserted
and has been assigned an SK
3) In your dimension processing you will need to add SCD handlers to update the DImension records you inserted as part of the early arriving facts

There is no need to update your staging tables before moving your data to the fact tables, this would defeat the purpose of using SSIS as everything is going to be much faster done in memory.
 
Hi MDXer,

I did not know about the project real I just downloaded the 270Mb something setup file but I can't find any other packages than the ones to handle AMO and partitions. Would be kind enough to point to the location of the process you are referring to please?

#1 and #3 are not an issue for me, #3 is already implemented this way. #2 I see how to achieve some of it, the rest I will look into the Project Real package and see how it's done. I may go with that or use Riverguy's method, I have to admit I kinda like it because it's a process that I can easily separate from the rest of the data flow and possibly execute in parallel with other things.

The idea of updating staging tables was not a good one, hence the use of past tense. I was just thinking that it was a possibility but quickly discarded it for the very same reason you mentioned. Interestingly, my first way of doing it was inspired from the Kimball MSFT Data Warehouse Toolkit examples, I guess it's fine for a couple of lookups but more than 5 and the data flow is simply a nightmare...

Regards,

Greg
 
Hi MDXer,

I just found the Word document that details the process you mentioned, sorry I don't know why but this is not included in the big zip file that contains the actual data and I'm not sure the actual SSIS packages I got from it contain that process.

I think I get what you were saying, that seems like a very decent idea but I'm not a huge fan of having such big .NET scripts inside SSIS. Arguably this could be an external component but it becomes more complex to deploy for multiple installs later on and maintenance would be, imho a bit more complex as well. Mind you, I do think it's a rather smart idea, but I think I might just stick to RiverGuy's advice for now and see how performance measures up on a sample.

Regards,

Greg
 
Be warned that Merge Joins are significantly slower than Lookup tasks. Also the size of the script has very little impact on your SSIS process it is what the code does that impacts the performance. I am all for pushing to the DB what the DB can do best but in the case of Lookups I have yet to see where the DB can assign SK values to Fact records faster than a SSIS Package.

The only thing you need to control in lookups is how much data you read into the lookup both in the number of records and what columns.
1) Limit your columns to those you will introduce to the
dataflow or those you need to preform the lookup. I
have implemented lookups effeciently that contain
hunderds of thousands of members.
2) Limit the number of records you bring into the lookup
task use views and join your stage and dim tables
together to bring only the dim members you require into
the lookup.

Yes at the base level you can accomplish the same through a lokup or a merge join but how those tasks work actually impacts the peformance of your process. Read about Sync and Async transforms to really understand what various transforms are doing and when to use them.

Yes you may be working on a small warehouse today but what about next year? If you learn to build efficient and optimized processes today you won't have to worry about fixing them later or carrying bad habits to your next and potentially larger projects.

Good Luck
 
MDXer, thanks for the headsup. The more I read about SSIS, the more it influences what I'm designing. I guess that's part of the learning process ;). I've looked at those differen types of transforms and I can definitely say that it will influence my decisions in the future.

It's easy to get something to work in SSIS, but it's harder to get it to work efficiently. I just wanted to say that when I was talking about using joins to load my staging table along with the dimension entries, I meant to do that in the source's SQL statement, not using MERGE transforms down the data flow after first fetching the source data from the staging table. Just wanted to make that clear.

Regarding the evolution of my small warehouse, I can guarantee for sure that the volume is nothing like Project Real's data, and will never be. Not that it's a reason to be inefficient though ;)

Regards,

Greg
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top