Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using a data warehouse as a source system? 5

Status
Not open for further replies.

danrobertson

IS-IT--Management
Jan 20, 2006
3
IE
How strict should the architecture principles placed on the DW be?

The classic principle that I like to assign to a DW architecture is that it is there for information reporting/analysis and definitely not for information processing/feeding to other systems. The DW is a target system, never a source.

However, I occassionally have a dilemma when a requirement appears for another system that wants to take advantage of the cleansing and integration that has been performed on data in the warehouse. It is difficult to argue that this requirement should be fulfilled independently straight from the source(s) into the new target when that would require replicating the integration and cleansing that has been performed by the DW. I would bend the rules at this point and allow the new target system to use the ODS as the source.

My architectures are almost always Staging Area, Operational Data Store, DW/DM:
- Staging Area would be a copy of source files
- Operational Data Store would be a normalised, relational store of integrated data (based on similar subjects)
- Data Warehouse / Data Marts would be aggregated, star schema stores

The real 'rule-bender' comes when a requirement comes from a target system that wants to take an export from a data mart. In this instance the full length of the DW ETL process has become a dependent part of this new target system. I often find this situation when the request has come from Finance who want to populate their General Ledger (GL) via this route. Their process and reasoning is that this way they can examine the data first before posting it to the GL and if they question any data they have the depth of the DW to drill into it and get their answers (this depth of data would not be available in the GL). My dilemma here is whether I provide the feed direct from the DW (which has been aggregated to the appropriate level, or whether I insist the feed goes from the ODS).

Should I say no to all these requests on the grounds that the dependent system/target will be subject to the validation routines on a bunch of data that it does not need and therefore may not load one day when a piece of data it is not interested fails a check?

What opinions do people have?

 
If they are populating a GL with this data how can they accurately report on the data? That is, unless they are not entering GL data from any other source or their are no Journal entries added at any point outside of their feed from the DW.

This actually seems backwards. Their GL should be populated, they can make entries and corerctions whenever needed and the ETL process pulls the data for reporting.

This way it can be used to see exactly whats in the system. I would fear saying my data is accurate if anyone has access to altering the GL at any time. If they can alter the GL it throws out the "one version of the truth" rule since it is being pushed data from the DW instead of the DW pulling in the data that is there.

 
Your job is to support the business. If business requirements call for a violation of your architecture and its unavoidable, you must change your architecture or standards. The need for clean data by finance is a business requirement, and should be addressed as such. Do it, but make sure they realize that they are now dependent on the DW processing and quality standards, and their data could be delayed or altered when they see it.

It is not unusual for transaction systems to get data from the data warehouse. Consider a datamart which is used to determine rates for auto insurance. The model reads the DW and determines the rates needed to have a good probability to make money. Those rates can then be exported to a transaction system which uses the new rates for customer quotes.



-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Thanks for both of these responses.

KingCrab, you make some valid comments and I entirely agree that the DW should be positioned as a company's single source of the truth for reporting.

To add more context to the scenario in response to your question about how the DW can accurately report on the data; The DW loads data into the GL from a source that has a history of providing 'suspect' data. The DW obviously holds more detail than the GL, so any issues with the data can be investigated in detail (and with the investigative flexibilty of the DW) before it is loaded into the GL. This is the key reason for loading the GL from the DW.

If the source data was OK the appropriate process should be:
The data is loaded from source trnxn system directly into the GL at the slightly aggregated level of detail that the GL requires, but the detail would still be loaded directly into the DW (from source trnxn system). Once the GL finance processes have completed month end activities, any adjustments (at aggregated level) are loaded into the DW.

So the key difference in the scenario I am talking about is that the direct feed is removed to allow for a 'quality' check in the DW, and once the month end finishes the GL feeds the financial adjustments into the DW.


Johnherman, your response is interesting and reflects the challenges that I have had in the past from business teams. I agree that the DW team's job is to support the business, however in my experience the business do not always know what the best technical solution is!...

I would use all the facts you state in your response but re-word them slightly to tell the business that because of the dependency that is created and because of the data integrity issues this data should not come in to the DW and potentially be shared with the rest of the business, instead the issues should be resolved at source otherwise it creates more work in the long run both in terms of additional validation routines and manual reconciliation activities.
 

Dan, I'm sorry if you interpreted my remarks to imply that the business should design your technical solution. No way should you ever let that happen. The business, however, drives the technical solution with their requirements.

However you want to deal with the data quality issues and the latency between data origination and its reaching a level of quality acceptable to the business is up to you. If you don't want the DW to be part of the cleansing cycle, then so be it. But I would not let dirty data into the prod uction are of my data warehouse; either its already clean or I would scrub it in the staging area(s).

I like the King's response regarding a single source of the truth, but I'd like to take it one step further and propose that the company's metadata and business rules drive the single source of the truth. If the metadata and rules can be applied throughout the business, then all systems are the source of the truth. This situation is, however, lacking in reality for the early 21st century. Ten or twenty years from now, it might be reality. In the absence of metadata driven systems, I'd be glad to accept the DW as the single source of truth.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
OK, how about viewing this issue from another angle:

Q. Is it possible to categorise, or place a 'rule of thumb' on the types of application that should be allowed to feed from the DW's cleansing/integrating processes?

A. Maybe it would be sensible to put some principles in place to get the right balance of usage (and to stop the design getting out of control and unmaintainable in the future) and stating something like the following:
> Downstream systems should only extract data from the ODS/Staging Area (on a read only basis)
> Downstream systems must NOT be on-line transactional processing systems
> Downstream systems can be 'management transactional' systems (such as planning/budgeting, Accounts payable, General Ledger).

OR, should it simply be:
If another system comes on-line and it requires some data from a single or multiple source systems and the data warehouse:
> ALREADY has ALL of that data (cleansed and integrated);
AND
> the new system is not impacted by the latency of the data in the DW;
AND
> the new system is happy to take this data from a READ ONLY location
THEN
> the new system should take advantage of the work that the DW has already done.

So, I am suggesting that the DW should not create new data cleansing/integrating routines simply for other transaction systems (unless the DW would be planning to use this data for reporting) and it should not be used just because it is there and the data latency some how impacts the new system.

I'm comfortable with this last set of rules, can anyone see any issue with this?
 
I like your last set of rules very much.

I would also, however, suggest that cleansed data might be of use to the OLTP system or to the Data Steward(s) of those systems. Seeing the corrections made to the data might help them identify a bug or missing edit in their processes, etc. In any case, I do not recommend a direct link to the OLTP from the DW.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top