Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Normalization...what are the rules?

Status
Not open for further replies.

gusbrunston

Programmer
Feb 27, 2001
1,234
US
Hi.

One frequently asked question is about storing calculated values. The consistent answer: "Don't do it...it's against the rules of normalization." I'm pretty sure that I understand the reason for not storing calculated values: if one of the factors used in a calculation is changed for whatever reason, the stored result is rendered inaccurate. This is avoided by recalculating every time the target value is displayed, printed or used in another calculation.

O.K. Is this the only "Rule of N"? Where do you find them. Access2000 help refers to normalization as the process of splitting a table into multiple tables to store data more efficiently.

Curious :cool: Gus Brunston
An old PICKer
padregus@home.com
 
The best place as ever is google.com. You'll find everything you need to know from thousands of sites.

Storing calculated fields is not really normalisation although it is to do with the same principles - how do you avoid inconsistent data, and how do you minimise disruption to applications when you adjust the data model.

The rule is: It's dead easy to normalise. It takes skill to de-normalise.

There are clear advantages in storing calculated data. It allows faster responses/more users. You can also do searches on calculated fields whereas you would have to extract all records to select values calculated at run-time.

Normalisation is a good guide but does not give you the optimum solution. mike.stephens@bnpparibas.com
 
Gus,
It boils down to an intelligent analysis of the data and it's usage--if it's a data warehouse with static data, then by all means denormalize, store redundancies, calculations, etc. If it's a live database where changes are frequent, then the cost/benefit of storing a calculated field or redundancy changes to argue against this.
--Jim
 
I tend to agree with JimHorton about denormalizing. Sometimes that's better than normalizing a database completely. The idea of normalization is that every field in the table should be dependent on the primary key only. I think it may be more efficient to denormalize when the resulting tables have a one-to-one relationship. Why create a primary key and use it as a foreign key just for the sake of normalization?

I'm a graduate student right now and I'm taking a database management class. This is the fourth database class I've had and all of the books we've used have covered normalization pretty well. My advice to you would be to go to a library and pick up any database book. The principle of normalization and third normal form were tough for me at first, but then the light went on. There's really nothing to it.

As far as storing calculated values. Usually it's better to recalculate than store a calculated value for accuracy. However, if you need to know the previous answer for comparison after the formula or some variable has changed, you may find it necessary to store the calculated value. It will depend on your project and how you set it up.
 
I would have to disagree with Jim and JJOHNS. I have yet to find a compelling reason to through out the Forms of Data Normalization. In fact, the more I study their usage in the real world the more I realize how pertinent and correct they are to follow. The closer you can come to fully normalizing your database the more efficient it handles the data. The more efficient it handles the data the less likely you are to have update anomolies. An added bonus is performance. I've found that the better a db is normalized the better the performance.
 
Jerry,
The performance issue isn't true at all. Take an example of a table where, under 3rd normal form, you'd be required to break a field into a new table with primary/foreign key, where there's just a single text field that's being normalized. Now *every* time you need that field, you're doing another disk i/o. So what if you have some extra space involved in carrying a 50-char text field instead of a 4-digit key. Big deal. The extra disk i/o is the killer. I'll spend the money on the hard drive space and break normalization rules for something that minor. When bulk updates to a foriegn field are required, then yes, updating the child table offers a performance benefit. But in the real-world, day-to-day OLTP systems that's less common. In my opinion, this 'ultra-normalization' costs more than it saves.

That said, I do believe that the aim should be normalization, of course--I'm not saying go to flat files, but to go for the theoretical maximum normalization is just not practical. I had said above that what's needed is intelligent analysis when looking at a field/fields that is a normaliztion candidate.

It may be nice to hear a professor--who hasn't been off the ivy-encrusted campus in years--talk and wax theoretical about normalization, but I know of no real-world, mission-critical database that has achieved full 3rd-normal form normalization, and rightly so.
--Jim
 
...to continue (and clarify)...lest it be thought that I'd design a db with dozens of redundancies as in my example above--as I said--analysis is needed. If the field in question has 2 different values, and it's in a customer table of 2 million, then yes, normalize. But if the field has 1.9 million different values, out of 2 mil, then under the rules you must normalize--here is where I'd say 'to hell with 3rd normal form', and put the entire field in the table. Else you'd have 2 million 4 digit keys in cust table, plus 1.9 mil 4digit keys in child table, plus 1.9 mil. 50-char text in child table.

This example should make it clear that there *is* a line to be drawn at when to normalize and when to not.
--Jim
 
Jim, what you're stating as an example is contrary to the Forms itself. Your example (which may not be a good one nor the one you're really trying to invoke) appears to be an attribute and not a candidate entity. I understand in principle what you're trying to say, and I've never advocated going beyond 3NF or Boyce-Codd. But, in general, following (up to and including the 3NF) the Forms will give you a much more robust and often far smaller and more efficient physical db that you will have much better performance. From an I/O perspective, it is far more efficient to have more tables with less fields than less tables with more fields.
 
Hi.

Thanks. I'll hold your coats.

:) Gus Brunston
An old PICKer
padregus@home.com
 
Jerry,
I'm not sure what you mean by 'contrary'. If, say, a Customer Type field has the same value more than once in two records in a table, then by the rules it should be normalized. But if (in my obviously exaggerated example) the number of customer types was nearly the number of customers, one would either re-think what 'customer type' really means, or decide not to normalize--even though technically is should be normalized.

But as far as i/o, I guess it would depend on what one was doing, for the i/o to be more/less efficient with normalized/redundant fields. If you're fetching a single record, and using my example, there is no way it can be faster when you have to do the extra i/o (most likely two--an index read then the table read) to get that extra field. An extra 46 bytes (as per my example) isn't going to require an extra disk i/o to fetch--that will exist in the result set when the sql is run. Bam--the user has data, no wating to fetch the child table's data. Like I said, after analysis you may or may not decide to take the field out and normalize it.

If I'm looking at a high-traffic table in an oltp app with hundreds of users, I'll look at things like this from the perspective of the disk-head. How many i/o's to fetch the data? Can the dba stripe the data on the outer disk tracks, spreading it across several disks so I've a good chance that no head-movement at all is needed? I've got 500 people waiting in line for the disk head, and I don't want to be like the guy on the phone, with a line waiting--he hangs up, but then he says--wait, I've got another call.

This is how I'd look at normalization in that case the same way--hang up the phone after the first call and let the next guy in. I don't care if the nightly processes take longer due to my decision--if the app dictates that the 500 users getting sub-second response to the table is more important than a nightly process taking an extra 20 minutes...well, thats where the proper analysis comes in and I don't care if Codd is disappointed in my design--I care that my client is happy because I met a usability goal.
--Jim
 
I'm sure you're aware that SQL DB engines do not work in the same way as traditional flat file I/O. Table size has a much higher impact on I/O than does table volume and/or number of tables.

I also thought that your example was a bit exagerated and therefore not really a good example to illustrate your point. But, I daresay, that you would find that actual throughput would be improved utilizing 3NF over redundant storage of non-normalized data.
 
Jerry,
Yes, it was a bit exaggerated. And I must admit I'm not very familiar with the low-level i/o characteristics of flat-file dbs.

But maybe in making my point you've gotten the wrong idea of how I would design a db...my projects have all been quite normalized, but it's just that I don't look at it as a design failure if someone can point out a field that, in theory, should be normalized into it's own table, when I had a real-world issue where I felt breaking the rules of theory was the proper avenue to take.

However, I still can't see how--even in a less exagerrated example, where the child table only has a dozen reocrds (the disk head still has to make an extra move)--making that extra fetch to the child table takes less time than just reading an extra field (which most likely is in the same data-block) from the original table.
--Jim

 
I think we're both on the same page (for the most part). I think there may be some misconception about where the bottle neck lies. In general, the bottle neck is not so much disk I/O as it is parsing and processing the SQL. DB Engines (from large SQL data stores to the lowly Jet) make prodigious use of caching for actual disk I/O. And in a highly normalized data structure, the actual I/O is less because the database itself is physically smaller. This is possible because you far fewer "empty" fields. Remember, any field that gets assigned takes up space in the underlying table. Whether it's holding any real data or not.
 
I have a project I'm working on right now. I'm a database programmer for a hospital (not just a student), so this is what it concerns. I'm automating a four-page nurse's assessment for our nursing services dept. The assessment consists of questions the nurse asks about the patient upon admission into the hospital. The assessment is divided into about 20 sections. Each section has about four or five questions associated with it. One section concerns nutrition; another disabilities, and so on. Following normalization rules, I should have created a table for each section, since each of the questions depends on the section it is in. However, I would have ended up with about 24 tables with one-to-one relationships. I chose to create four tables, one for each page, to simplify my database. The queries that create the output are much simpler this way. That's a real-world example of a slightly denormalized database.
 
No, you would not create a table for each section. Nor would you create tables with one-to-one relationships. This is not normalization. If attributes are true descriptors of the entity then they belong in the table for that entity not seperated into other tables linked back to the entity in a 1-1 relationships. This is poor db design.

There are two approaches you can take for your scenario.

1) "hard code" the questions into the table(s). That is, have a field to capture the value of each question. This is non-flexible and if the hospital needs to change or add questions they must call you (or some other db programmer).

2) Create tables and relationships that allow the user (i.e. hospital) to define the questionaire for both questions and answers. This method is flexible and allows the hospital to have any number of questionaires (not just the single one you hard code) with an essentially unlimited number of questions for each questionaire. You can even have it so that the user can define the type of question and the allowable answers for each.

Normalization, (at least in my experience) has NEVER called for the use of 1-1 table relationships. That type of relationship is usually used as a matter of convenience or for security purposes.
 
As a general rule joins are fantastically more expensive than single table accesses. Jet uses various strategies and yes sometimes a join will be cheaper but very rarely.

This is particularly true where two or more joins are involved. mike.stephens@bnpparibas.com
 
So, Gus. Does that answer your question?

I think one thing is painfully clear now. There are two types of information technology professionals in the world:
1. Those that think their way is the ONLY correct way to do things, and
2. Those that think their way is the BEST way, but not necessarily the only way to do things.

Personally, I don't trust anyone in that first group. They should know better.

 
One would have to be very stupid to think their way is the only way. I, myself, have violated the Forms of Data Normalization when I felt it expedient to do so. I have also regretted that decision many times and have found through painful personal experience that the Forms were written and designed to save me that pain.

I regret that you feel compelled to take our discussion personally. I found it very enlightening. You make some excellent points that are worth considering.
 
Yes, JJ, I got more answer than I bargained for. The benefit of this disucssion to me was that it drove me to a book I've skimmed, but never read, for this digest of the "Rules of Normalization":

1) Field Uniqueness:
Each field in a table should represent a unique type of information.
2) Primary Keys:
Each table must have a unique identifier, or primary key, that is made up of one or more fields in the table.
3) Functional Dependence:
For each unique primary key value, the values in the data columns must be relevant to, and must completely describe, the subject of the table.
4) Field Independence:
You must be able to make a change to the data in any field (other than a field in the primary key) without affecting the data in any other field. (The crux of the "storing calculated fields" problem.)

From Running Microsoft Access2000 by Microsoft Press.

Self-instructing tyros (like me) want to "get started", sometimes at the expense of subsequent difficult and time-consuming repair and re-design. A general understanding of the rules (and the reasons behind them) would have saved me countless hours and expense 10 years ago when I learned to use Edward Pick's ahead-of-its-time system, and ended up with hundreds of unnecessary tables with repititious (and often conflicting) data.

While tek-tips has been great for quick help at the point of attack, this discussion is one I'll keep for reference.

Thanks to all :)
Gus Brunston
An old PICKer
padregus@home.com
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top