Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

A case for denormalisation? 1

Status
Not open for further replies.

nathanstevenson

IS-IT--Management
Oct 4, 2002
70
GB
Hi all,

I was following a thread recently discussing the merits of normalisation at:


I think there is a specifc case in which it could be useful to include redundant data. It specifically relates to queries.

Say for example you have 15 tables all normalised down to one linked thread. Now the problem is that you want field c from table 9 and field a from table 1. Now to run a query which will compare the two, you have to include all the intermediate tables inorder to query these two fields.
Let's take it even further still. Let's say that for tables 1->8 there is a "top-down" type relationship structure, all the way to table 9. BUT tables 1-5 all have direct relationships with table 9 too. And you would want to compare them directly with table 9 without taking the other tables into account.

E.G. Which timesheets belong to which assignments, when it goes assignment, manager, contractor, job type, job details and then timesheet. It's much quicker if you simply include assignment ID in the timesheet table. And so too for the keys of tables 1-5.

Surely it allows for faster DB queries, and simpler coding?

What say the jury?

-Nathan Stevenson
Data Analyst
 
I agree, there are cases where query performance is more important than proper normalization. But I don't that we can generalize and create a set of rules for keeping redundant data in the database.

Each scenario has to be taken individually. It you need better query performance and it requires some redundant data or extra look-up tables then so be it. You just have to make sure that all updates are done properly.

I like to normalize the data and then see how the database performs. If the client(s) complain about performance then I can look at using redundant data to increase performance.

This redundant data will allow faster query performance but I don't know about simpler coding. Simpler SQL queries, yes. However you will need to add code to properly update or add all the redundant data. That to me means more and/or more complicated code.

There may be other things to consider as well. How fast will the database grow? How many users will be accessing at once? What kind of hardware is available? How much reporting is needed?

Bottom line - you have to look at the specific individual needs of your project and do what's best for your project. Thanks and Good Luck!

zemp
 
I definitely agree with regards to tailoring your rules to the situation.

I suppose there is a trade-off when it comes to complexity and development time. For bigger DBs with many users, the complexity increases and the speed gained through fewer joins and simpler queries would be offset by the added programming overhead and complexity in updating the redundant data.

Nathan
 
You state that you want to denormalize the table, but when you say:

'It's much quicker if you simply include assignment ID in the timesheet table.'

I don't consider that to be denormalizing, it's including a foreign key (the assignmentid) in the timesheet table. Leslie
landrews@metrocourt.state.nm.us

There are 10 types of people in the world -
those who understand binary
and
those who don't!
 
trust me. duplicate data = BAD BAD BAD.

troubles in the long run.

you might feel clever now - but there is a reason for all this field theory dude!

if you still dont believe me go and read something about relational predicates or Codd's classic paper.

:)
 
I agree in that some cases, denormalization for the sake of performance, can be right. But such cases are few and far between and must be analyzed on a case by case basis.

Before I choose to introduce a duplicate data item, I need to know statistics about the volitility of that data. I'm much more likely to allow duplicate data if I know that that item is updated typically once a month, then duplicate an item is daily.
Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
I do know one thing...BAD BAD BAD = lot's of production support. Lot's of production support = angry customers. angry customers = frustrated programmer.
 
topher0303 - I'm not sure what your BAD BAD BAD refers to, but I'm assuming redundant data, and yes it should be avoided, but there is another equation that you need to take into account.

slow response times = angry customers
angry customers = frustrated programmer

and in some critical system cases

slow response times = system failure

You must realize that its not a perfect world, and there are times where we have to make tradeoffs, and normalization vs performance, can be one of those cases, especially when dealing with multi-valued dependances, or mission critical systems where response time are paramount.

Believe me that I insist that initially, all databases are to be normalized, at least to 3rd normal form, but on a case by case basis, as other requirements dictate, and after due diligence, we can back off that position when necessary.

Sometimes we must accept the lessor of two evils, and deal with it.

I applaud you adherance to the tried and true principles of relational database design, but there are times when we must use our experience and our judgement, to apply some flexibility in our profession to do what is best for the customer. Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
I have deviated from fully normalized databases from time to time. But the are few and far between. When I do, I make every attempt to keep the integrity code in the database using foreign keys, triggers, etc. It always seems if you let the application account for the consistency with duplicated data, you end up getting support calls that result in manual data cleanup.

In my experience, app response issues are, often times, the result of poorly written sql, an indexing issue, or the programmer trying to do things in the app code that should be handled on the database side.

 
tables that support OLTP (online transaction processing) should be normalized

but OLTP is not all that databases are used for

summary tables that support OLAP (online analytical processing) are frequently denormalized with much redundant data

on purpose

both types of table can exist within the same database

in my experience, summary tables are not "few and far between"

they do not need to be justified on a case by case basis, rather, they are built that way for a reason

OLAP strategy (OLAP does not necessarily mean a half-million-dollar reporting package, it's a concept) allows users to query, for example, a few hundred category sales totals rows for the previous month instead of having to do GROUP BYs on detail sales rows filtered from the millions of transactions in the OLTP table

if you have lots of users querying sales totals, then redundant data is GOOD, GOOD, GOOD

and as far as triggers are concerned, you don't need them, and in fact they would be wrong, because summary tables are loaded once, e.g. at month end

users love summary tables

i try to keep an open mind about redundancy

rudy
 
I won't pretend to be an expert on OLAP r937, but the concept you described is not exactly redundant data... it is, in your own words summary data.

If all the OLAP table did was copy the data over so it could be GROUP'd by there instead of somewhere else, that'd be denormalized, redundant, and plain silly.

What you've described is storing the results of a specific averaging/summing/whatever query so that when someone wants to run that same query again, it's already done.

You're also stating ahead of time, this is a summary, snapshotted at time=X... it's not expected to remain current if someone goes and adds another sale right now, or if it is, it's because adding a sale causes a trigger to re-run that time consuming query.

Redundant data is simply storing the exact same information in two or more places, and has been stated here... is a bad idea except in very few situations where performance suffers too greatly in a normalized environment.

The problem with redundant data is exactly keeping it synchronized through additions, deletions and updates.

-Rob
 
By all means be flexible. Make sure that your primary data storage/management is compact, normalised, current, and robust. If your day-to-day needs require rapid analysis or querying of the whole dataset by lots of departments simultaneously, by all means produce snapshots of your data in a denormalised form at the intervals you require, and store these locally for ease of use. For god's sake don't compromise on the way that your data is stored, just because that is not the way you want to use it.
 
I should hope you are not housing OLTP and OLAP in the same database. They are two totally different beasts requiring two totally different design patterns queried by two totally different tools. If you are throwing a couple of summary tables in a OLTP database, that is not OLAP. Then there is alway ROLAP, but then you are sacrificing performance from both your OLAP and OLTP.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top