Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

what is data ware housing ? 2

Status
Not open for further replies.

tara

Programmer
May 16, 2000
1
0
0
US
hi !
i am to choose a project as my final year project.i am interested in data ware housing.i know very few things abt this concept.please let me know whats data ware housing n its uses.would be looking forward to your reply.
thanks

 
We are in the Motor Manufacturing Industry and use our Data Warehouse to support management reporting applications and systems. Via interrogation through front end tools such as Cognos, Business Objects and Hummingbird.

Adam.
 
Hi,
I know a company who really hates the term Data Warehouse because they feel Warehouses are Big bulky cumbersome places where no one can ever find exactly what they are looking for.

When it comes to their data they know this is an unacceptable situation.
 
I fail to see what the last response to tara's question will do for her.

So, here a bit of a on-the -fly description of DWH for Tara's sake:

"Extraction,cleansing,aggregation and denormalization of transactional data to a new platform with the purpose of creating sources for reporting and analysis".

Some antagonists:

Ralph Kimball vs Bill Inmon
Virtual datawarehousing vs Physical datawarehousing
Datamarts vs Corporate Datawarehouse
Realtime datawarehousing vs Traditional datawarehousing
Coded extraction vs Use of graphic ETL tools
Starscheme vs Snowflake (??????)

Probably says nothing to you right now, but you will come across these topics.

Extraction:
-----------
Establish which sources from transactional system and field are needed to create new sources in DWH. Choosing how to transfer data to the DWH (using code, or ETL tool)

Cleansing:
----------

Is data still valid , are entries missing, trying to repair obviously faulty information from transactional database before it is again stored in DWH

Aggregation:
------------

Most often reporting and analysis is not done on the atomic level (each individual transaction/mutation) but on lightly or heavily aggregated data. If sales revenue is needed for customers per month, creating aggr target to this level will give enhanced report performance.Same Data is often stored at different aggr levels for reporting. The higher aggr level , the smaller targets, the faster reporting, but loss of detail.

Denormalization.
----------------

In a transactional system customer related data , such as address etc are stored in seperate tables to reduce redundancy. Every mutation in an address should preferably done just once. DWH tables can store such details at row level (usually historic data is not updated) to create fewer (and larger tables), reducing number of joins between tables, thus increasing reporting performance.

Succes with explorating the DWH field! T. Blom
Information analist
Shimano Europe
tbl@shimano-eu.com
 
blom0344,

what do u mean when u say Virtual Datawarehousing Vs Physical Datawarehousing? Realtime Datawarehousing Vs Traditional Datawarehousing.... can u give a more detailed insight into that....

when u say aggregate data to the level based on one's need say customer's per month one might also look at other ways say product per month etc.. in that case u mean to say one gotta create separate aggregated table for each level the end user might look at the data.... i do agree that aggregation will enable one to do faster querying but what if aggregations are required at different level....

regards,

sridharan
 
Virtual vs Physical

In the first case user runs a report that actually triggers a number of queries on separate sources in realtime, without actually connecting to a physical datawarehouse with pre-assembled datastructures in it. You can argue that this is not datawarehousing at all since no warehouse actually exists. But to a user this may look like getting information from one central place .

Realtime versus Traditional.

In traditional set-up you always have a 'lag' between contents of the warehouse and the latest information based on the most recent transactions in the OLTP system(s). It is quite common to look at yesterday's data, because update of DWH only runs during the night. Real-time datawarehousing reverses the process. Instead of pulling data from the OLTP system, agents run and process the data from the OLTP system to the DWH. Effect is that there is no data-'lag'. This is partly Science-fiction, but developments are underway.
For a lot of reporting , traditional datawarehousing is good enough. Monthly reporting (reports on the first of the month) won't benefit from real-time attempt, cause there is no real-time effect involved.

You got me wrong about the different aggregating levels. If I have masses of reports that only show sales revenue for a given region/ salesrep and user is not interested in drilling down to lower levels like customers, orders, products the report can run on an aggregate with only region and salesrep as dimension. This aggregate can easily contain a hunderd times less records giving instant reporting speeds, while running the same report on the aggregate with 5 dimensions (almost no aggregate at all) can result in very long refresh times and processing very large number of records quite needlessly.

In reporting there is always a balance. Want to be able do drill through data, then accept slower refresh speeds because actual data-set is so much larger. Simple revenue reports for higher management can then run on very much smaller aggregates.

Is this enough for insight? T. Blom
Information analist
Shimano Europe
tbl@shimano-eu.com
 
Sridharan,
You have raised a legitimate concern on aggregation. It is impractical to create an aggregate table for each level of detail (or more accurately, for every COMBINATION of dimension levels). The general workaround, as Blom suggested, is to attempt to know in advance which aggregates might be most important.

However, the ugly little secret of aggregate tables is that it's pretty much guess work most of the time. If you base it on existing reports you'd better make sure that is the summary level that they really NEED as opposed to just what the are getting now. You need good business analysis to attack that problem. But that's a given.

The bigger problem I have with aggregate tables is what I call the problem of entropy; the database only gets more complex, it never gets simpler.

It's easy to create an aggregate table("Oh, you need sales, by product LINE, by WEEK? No problem!") But once an aggregate table is built, it is very hard to get rid of it. Even if you discover it is hardly every used. ("Maybe it's the CEO that only hits that table once a year. I'd better leave it.")

As a result, more and more aggregate tables get created, each one used by fewer and fewer people.

The problem is most data warehouse design focuses almost exclusively on SQL reporting. A very reasonable compramise is to create a few (carefully chosen) aggregate tables for SQL reporting, but use an OLAP (On Line Analytical Processing) tool to give the users all the different levels of summary. OLAP tools let you "slice and dice" through any combination of aggregation levels you want.

Some do that by pre-aggregating totals at all combination levels, other by summarizing up to a certain "base" level (like a single aggregate table) then summing all the level combinations "on the fly". (Very basic description.)

As long as you keep the details in the warehouse, the users can perform most of their analysis using the OLAP tool but still be able to drill down to find the details when they need them. Don't try to use one tool (SQL) to solve all problems.

Just my 2 cents.
Matt :)
PS. The "vendor" I work for sells both data warehousing tools and OLAP tools so I'm not trying to "push" the discussion in one direction or another. I just see a lot of folks driving "screws" with "hammers".
 
Matt,

I do agree that itz impractical to do aggregations at all levels before hand.. General belief is aggregations helps one to do faster reporting, analysis.. Yes! But also has itz own limitations...

You've said that OLAP tool will help users "slice and dice" for different combination of aggregtion level but still these query has to run against the database where only detailed data is stored and unless the database is fast enough to these tasks, again the users are going to experience a very slow reporting...

Sridharan
 
I think we all told some parts of the ultimate compromise that datawarehousing is. I agree with Matt that different aggregating levels is a bit of a blind alley in a way. But with key-users needing dozens of reports which actually show the same information every month for the next month and year-to-date they want SPEED...
I work with OLAP tools as well, and my first approach was not to give them all sorts of aggregation tables. I pointed out all the benefits of multi-dimensional analysis, drilling etc. would be lost with pre-aggregates. Sure, they said, we like that, but it should fast too......
Waiting 2 minutes for new data to be displayed, causes managers to start questioning if the tool at hand was such a good idea after all......
In my opinion there are 2 basic types of reports.

1. The periodically returning standard report.

There are lots of these in house, the structure stays the same ,just displays new data. Hard to sell it to a user that these will be slow everytime he wants a refresh.
Here aggregates at database-level can be valuable

2. The ad-hoc report based on non-expected demand.

No way of telling what users come up with next. No use to come up with aggregates. They should run it on a detailed source and take into account that it can be slow. Usually this is acceptable, cause we're dealing with a possible unique event.

In fact asking users what they want and expect is quite a hopeless mission in the first place. Only extremely structured thinkers hand you a list with requirements. Most of the time you build, they comment,you adept etc.....
Setting up a company broad DWH from scratch failed at my company before and seems to have a very low succesion-rate overall. Getting people to accept the DWH concept, you have to go for SPEED , otherwise you'll lose them on the way. Can't work without some level of aggregating then.... T. Blom
Information analist
Shimano Europe
tbl@shimano-eu.com
 
Sidharan and T,
There are many flavors of OLAP tools (ROLAP, HOLAP, MOLAP, E-I-E-I-OLAP). Not all of them require reading pre-aggregating or reading the data from detail every time. If you don't like what yours is doing, look at some other options.

In my experience, as long as you use the OLAP tool for its indended purpose (FAST analysis of many variations) and SQL for what it does well (detailed reporting), you can have the best of both worlds AND keep the customers happy.

Matt :)
 
Hello Matt,

I am sure there are very fancy flavors of OLAP around. But we live in a real world where it takes an AWFUL lot of persuading to draw users from their habit of re-inventing the wheel with their custommade SQL's hogging OLTP systems.
The concept of using a datawarehouse as a source for reporting and analysis is not something that is easily swallowed , Especially but key-users who - sometimes - derive status from the fact that they now their way around in SQL. I would not even dare explain to them about the alternatives in theoretical terms (the only way to persuade is to show flashy results)
Anyway , how would your solution look like for very standard monthly reporting on , lets say, sales without using any form of SQL? (perhaps with one of the OLAP flavors) Even if you do not see the SQL, is there no SQL generated beneath the surface......? T. Blom
Information analist
Shimano Europe
tbl@shimano-eu.com
 
My view is that if you hold the data in the DWH de-normalised (at it's actual level with no aggregation) using this for a while.

THEN, if you start getting problems/delays with your aggregated data, create summary tables at the level required for your aggregated reports that are causing the problems/delays.

Just my opinion...

Regards,
Matt Matt Jenkins
Information Engineer - EDS
 
Hi T,
I agree, data warehousing is not easy or inexpensive. And if you want to get your users thinking of alternatives to direct SQL reporting from OLTP, then a flashy demo of a good OLAP tool, with the users' data, is a good way to do it.

As for your question on standard reporting (e.g. Monthly Sales) without any SQL, there is some SQL involved. The question is, how much?

My favorite is a HOLAP solution. In your sales example, a SQL query loads up sales into a cube at some slightly aggregated level of detail; say by sales rep, by customer, by product, by month. The users can then slice and dice in the cube to their heart's content at that level of detail or any combination of higher levels (e.g. Branch office, by Customer, for all products, for the last 3 quarters) without any SQL or any hits on the database. The cube automatically aggregates and displays the results (usually with nice, pretty graphics). One cube can replace about a thousand reports.

If the user needs to see specific detail they can "drill through" to a supporting report that automatically filters to the level of detail they were examining in the cube (e.g. "Just show me the specific order lines for 50 megawatt widgets sold to Bob's Fish and Tackle during the first 2 months of this year.")

The nice thing about this approach is that SQL is only needed to create the cube (once a month, once a week, whatever) and whenever the user needs to see specific detail. Cubes won't ever replace all reporting, but they can sure eliminate about 80% of it.

Now, in theory, you can use this approach without building a data warehouse, but just as your reports will be better and cleaner in a warehous, so will your cubes.

As always, just my opinion.
Matt :)
 
Hello Matt,

Since I use Bus. Obj. I know about the slicing and dicing / multidimensional analysis. Point is that while this is met with much enthusiasm during presentations, in reality most people seem to favor very complicated , but static reports. Typical report for sales consists of fields like month, year-to-date, month last year, year-to- date last year, fields with difference between current and last year, etc, etc. This could well lead to 20 or 30 columns of data. (Very often one is just rebuilding the format they used in Excel). No manager here favors a dynamic approach, they want it fast and exactly the same as last time. Fast means , a limited execution time (fast SQL) and a pre-aggregate with 50.000 records wins over the detail table with 2.000.000 records hands down.
To make matters worse, there is the issue with paper consumption escalating with the advent of the computer. Every report is send to the printer , where every manager starts to makes his notes with pencil drawn. Works fine with the static report, but kind of useless with drilling.
The technique is there , seems people are not ready yet.... T. Blom
Information analist
Shimano Europe
tbl@shimano-eu.com
 
Perhaps you should look at a different tool. Check out PowerPlay and Impromptu by Cognos. Everything you describe is very common and the users seem pretty comfortable with it.

If you can't change BI tools then SQL access directly against aggregate tables may indeed be your best option. Sorry.
Matt
 
Can you please tell me if there is any diff. btw, an erp and DWH or are they both complimentory to each other?
 
Hello Guru,

An ERP system is a transactional (read/write)system aimed at processing day- to day transactions aimed at producing such things as orders. The DWH is a read only (for end-users) database which can contain information from the ERP system (historic information), along with information from other systems. It is not a matter of the two systems being complimentary, they serve different purposes. ERP ( and other OLTP systems) deal with real-time mutations, the DWH contains 'dead' information. Imagine going to a library, being allowed to look up information with a very advanced indexing system, but not being allowed to write a book yourselve or change the contents of the books available. This is the DWH in essence. Now write a book , get it published and one day it will end up in the library. The process of writing it and getting it published can be seen as some sort of OLTP (ERP) action, the copy of your book in the library can be read by others. Changing it's contents means mutations on the first edition, publishing a revised , second edition and asking the library to replace the first with the second edition. That would mean a rare occurence of updating your datawarehouse for a specific topic. To stay with the analogy:

Datawarehouse = Library,
ERP = Writer/Publisher T. Blom
Information analist
Shimano Europe
tbl@shimano-eu.com
 
Thanks Blom,

That made my life easy. It was a really good analogy. It helped me understand the diff better. Can you suggest me a url or book, that is a good start to learn DWH.

Thanks,

Gururaj
Digital Globalsoft. (Digital EQ. Corp.)
Bangalore, India.
gururaj_blr@hotmail.com
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top