Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Agile Data Warehousing?

Status
Not open for further replies.

dmcmunn

Programmer
May 20, 2003
85
US
Given the "almost chaos" level complexity of building business intelligence enabled enterprise data warehouses in today's world of rapidly changing multi-organization, multi-merger/acquisition, multi-national, multi-lingual corporate world, what are some agile methodologies or practices practitioners have successfully applied to producing meaningful, incremental data warehouse architecture capable of rolling with the punches with a minimum amount of rework?

Suggestions for some practical books, websites, authorities or other resources would be much appreciated.

Thanks in advance.
 
The most agile we can be is to capture, in a basic star schema, all dimensions and facts at their most granular level. This permits us to aggregate, forecast, and analyze (data mining, OLAP, etc) in whatever flavor of the week third party or custom software is in vogue.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Well,

a star-schema at the most granular level allows for most flexibility on the reporting side (the Datamart). For the ETL proces however this is not necessarily the case.
I read the question from dmcmunn as how to architect your ETL process so to be flexible when a new source system (or changes to a source system) needs to be added.

dmcmunn, can you comment on this?

Oh, in addition to johnherman. Some practical books can be found in the datawarehouse (etl/lifecycle) toolkit by Ralph Kimball.
 
Hans63,
I have read the ETL book from the Kimball Group and the MS DW Toolkit as well. Both very helpful reference guides for practitioners. I believe many of the agile software development tenets such as:
(1) short(2-4 Week), regular time periods between deliverables,
(2) initial broad proof of concept,
(3) team prioritized work queues,
(4) continuous customer involvement,
(5) develop functional tests
can be very helpful for all aspects of data warehousing projects (including, as you point out ETL, where much of the heavy lifting of a data warehouse / BI project is done).

I worry that of late there have been vendors purveying alternative BI solutions to avoid the perceived "headaches" associated with data warehousing (data cleansing, encrypt/decrypt, consensus business definitions) instead suggesting the end result is achievable with much less formality and structure. The "quick fix". I believe agile data warehousing has the potential ability to provide better quality results, in such an environment. What I am interested in learning about are the proven tips and methods that have been used in practice with desirable outcomes vs. those that have failed and why. Incidentally, I agree with John that dimensional approaches have proven to be the most flexible when there are known business questions to answer. However, in the case of predictive analytics or data mining, relationships may be unknown at the time one seeks an answer. I digress. What I am interested in learning is what has worked in practice not theory.
 
My comments on the tenets you wrote down:
1) do you mean you should have 2-4 weeks to implement a change from inception to production? This would seem very short for some cases.
2) Broad, but not too broad. I had to deal with a POC where ALL parts of the DWH had to be implemented. This was disastrous. We implemented this stage by stage for all source tables and all datamarts. You should build a POC for a small set, but with enough added value for the organisation. Allow for your team to learn from past mistakes without having to redo a major part.
3) no comment as I do not readily understand what it means.
4) MUST have. Without this, chances are that yoour DWH will not be used (as much as it could)
5) MUST have. test every stage in the ETL process. Build a different set of SQL-scripts that should do the same transformation your actual code does. This is time consuming, but really pays off.

For a description on how I have implemented my latest DWH, see this topic: Kimball's recommendation for staging.
 
OK. I confused agile development techniques with flexible DW designs.

1. As far as short periods between deliverables; nice if you can do it. Depends on your level of staffing and the degree to which you can get business to agree on requirements well in advance of the development cycle. Short is preferred, however, since a change in business strategy or in the marketplace can require rapid changes to Decision Support systems like DW.

2. Broad, enterprise-wide requirements and design, but everyone agrees that the POC for a DW should be a small subject area with a measurable ROI. This gets the momentum going.

3. Nice if you have a large staff. I can rarely keep 2 development paths moving at one time.

4. Absolutely. The DW is a Business project, not a Technology project. See also #1 regarding changes in the market and strategy.

5. I find that most of the testing involves validation and verification of counts and amounts between source systems, intermediate structures (staging, ODS), and star schemas and datamarts.



-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
An addition to point 5 - testing.

Full scale testing proves to the business that the data in your DWH is in fact correct. Even if the outcome of some queries reports may not be as expected. Testing thus gains consensus among your clients (business analists, managers, etc.) which is very important.

If your business is not ready to trust your DWH, you keep explaing the differences to them. (usually pointing out data inconsistencies in the source system). This eats away valuable time.

I can only stress this: test every stage, every transformation.
 
johnherman and Hans63,

First, thank you very much for your very thoughtful responses. This is just the sort of dialogue I would like to pursue.

I would suggest that getting agreement on requirements well in advance of the development cycle is nearly impossible in all but the simplest of data warehousing / DSS projects due to complexity, lack of decision-maker understanding, poor knowledge transfer from business to DW team or even simple requirement misinterpretations, resource re-allocation, budget fluctuations, mergers / acquisitions, and as you mentioned new business rules. These are precisely the reasons we must find ways to "test" for these discrepancies as early as possible in the process. I absolutely agree with the use of a POC, especially with new technologies combined in new or unfamiliar ways, but not POC in the more generally accepted, mainstream DW view. I will call it a "DW POC micro-cycle".

1. Prove the technology, architecture, and function, first.

If unacceptable, rework and iterate until success is achieved.

2. Require integrated testing and quality feedback communication in every milestone step of the process (as suggested by Hans63 above, too). Milestones may be defined through previous points of failure or purely based on experience or even to meet regulator and/or compliance requirements (Sarbanes-Oxley, statutory, federal, Medicaid, copyright, etc...).
Testing means nothing if no one monitors and acts on the results, so tests must be published and viewable by the appropriate team members (DW developers, DW managers, source system developers/managers, etc...), and even summarized for the customer into an overall DW quality assessment. Every inbound atomic piece of data, every data stream, every transformation, every user access, every query, every data profiling run, every business rule change...where does it end? Indeed how do we decide when "enough-is-enough" when it comes to testing? The cheap answer is: "when the desired quality reaches an acceptable level". So we should have quality thresholds integrated into our DW solutions to serve as early warning bell-weathers at the boundaries of or even in the bowels of the data warehousing machinery. Not just the counts of the inbound rows and summarized dollars, people and number of transactions, but ways to move data into the data warehouse and still record its quality scores (previously described in "The Data Warehouse ETL Toolkit" by Kimball and Caserta). This begs the development of a data warehouse quality decision support system, eh? But I'm way off on a completely different rabbit trail now...

If you can't test it, you shouldn't be doing it. As Hans63 pointed out...it probably isn't going to be easy or cheap to produce a robust testing culture in a DW effort. As you point out, testing of ETL counts and hash totals is necessary, but insufficient. Agile purists with whom I've worked have said, "Testing should be done only to the level required by the customer." (Translation: what they are willing to pay for via time/resources!). I am certain the establishment of professional testing (technical, performance, functional) consultant tiger teams for data warehouse architectures / solutions is a field ripe for the taking by the experienced practitioner!

There are many more items to be discussed here, but let's delve a bit into the POC before I ramble further...

PROOF OF CONCEPT

Pre-requisites:
1. All system and application software and hardware have been installed tested and are functioning correctly. There can be a HUGE amount of elapsed time from start to finish in addition to hands-on effort required to get to this point.
2. Point persons (names, phone, email) for all required data access and authentication needs have been identified.
3. Network, server and application credentials and accounts have been established, documented and shared as necessary.

A suggested POC...

I would suggest approaching the proof-of-concept phase of a new DW / BI effort with a traditional broad, but shallow micro-cycle of 2-4-6? weeks typical of agile development.

1. identify the highest value goal attainable in for the desired micro-cycle that will allow the team to:
a. Identify a single, data source central to the effort
b. Confirm security credentials are accurate, complete and provide the required permissions to this data source
c. Exercise the ETL tools, applications, adapters and activity logging to retrieve, stage, load the data source into a mock staging area and track each step.
d. Exercise the version control mechanism chosen for the data warehousing platform to "check-in" all ETL scripts, test scripts, report definitions, database DDL, and any/all other project documentation.
e. Deliver a simple data quality assessment of this single source of data.
f. Install and configure automated data warehousing communication / notification mechanisms and confirm they are functioning at least at some minimal level.
g. Design, develop, test and deliver a simplistic, but formatted master data report using the main information delivery tool chosen for the largest population of information consumers.
h. Configure the information delivery tool authorization scheme to allow a single information consumer to view the report and/or export the report into one of the tools (if any) required by power users or other downstream applications (predictive analytics, statistical package, or SOA-style apps) in scope for the project.

As you can see, we will have performed a relatively broad scope of functions to provide a credible POC. What else should be done (or left out) to improve this based on your experiences?

Resource Criteria:
Three ( 3 ) highly-skilled, experienced, resources trained on the tools, techniques and applications must be able to complete these tasks in less than 4 weeks with ancillary assistance from supporting staff.

Alternatives:
If they can't do it in less than 4 weeks, make sure the scope hasn't crept, or split the process in to two portions: 1. one internal set of deliverables to the team, and
2. one with functional deliverables the customer can identify.

This allows confidence in (or reassessment of) the infrastructure and architecture prior to investing many more hours required to deliver an entire subject area.

There are many more tasks to consider, these are only based on my experience.

I look forward to your responses.

Best regards,
dmcmunn
 
Sorry for the delay.

I take it that your POC is done at (the start of) development.
You can also include some data-quality checks in the daily (nightly) run. I have not done any other checks but counting (and summing).

As we had all source systems in maintenance by the same organisation, we were able to shift the data-quality issues back to the source system. We were however able to show them where exactly the data-quality was an issue. This sometimes led to changes to the source system.
We always allowed erroneous data to move all the way to the datamarts (but identifying them in the datamart). This we did because it is usually only 1 attribute of a record that is false. So, a few reports may be off by a couple of records. But many other reports were showing the right figures.

Fot the rest:
Your approach to the POC looks pretty thorough. This is good for quality, but bad for getting management support as it takes a long (what is long?) time to get going.

SO far for my 2 cents.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top