Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Design Question: Too much normalization? Where do you draw the line? 6

Status
Not open for further replies.

flugh

Technical User
Aug 23, 2002
655
US
I am trying to take my database (currently Access 2000, soon to be MySQL 3.53, hopefully PostgreSQL once I get a grip on it) from the dark ages of weekend hacks to a solid, well-formed and normalized pile of data. Something I want to do is be able to perform some cost-analysis by taking revenue generated and subtracting payroll, truck cost, etc. This should get me to a day-by-day representation of income versus cost of doing business. Hopefully.

Anyway! I am considering having a "profit center" with daily entry per client at the top of my "food chain" in one table. Then each truck's daily run linked to that row in the "profit center" table. I'll also have another table with non-truck related payroll (warehouse hours, etc) linking each entry to the profit center table. Just a thought here, on to the question...

How far do you go to get extreme normalization? I am looking at the date for deliveries made, and getting that about 3 related tables down to payroll (does that make sense?). If the date is in the "profit center" row, and the truck's daily trip is related to that profit center record, do I put the date in the delivery record (3 related tables down) in order to make an easier query (that's 3 LEFT JOINs right?) to find out a day's payroll, do cost analysis, etc?

Sorry if this makes no sense. I'm not formally educated in this stuff, self-taught from help files and the internet at large (Tek-Tips being a big chunk of my learning!).

Ugh, previewed and couldn't make sense of it myself!
Code:
ca_pcenter
ID, client, date

man_trips
ID, pcenter, truck, s_miles, e_miles

inv_stops
ID, trip, name, zip

inv_bill
ID, OID<inv_stops.ID>, client, type, fee
Hope that helps. See, inv_bill is "Complete, billing the client for" stuff. In order to generate an invoice for a date range, I'd have to do JOINs across 4 tables to get back to my original date. But having the same date repeated in multiple places is "just plain wrong" and should be avoided if possible.

So, where do you draw the line between normalizing, and the aggravation and overhead of getting all the details for your data?

----
JBR
 
In most cases for Access databases third normal form is normally optimal. Beyond that you will just be building queries to join your data so you can get the results you need.

HTH,

Steve
 
Please excuse my ignorance :) If third form is optimal for Access, what would you suggest for a 'real' RDMS like PostgreSQL, MySQL, MS SQL, etc? I am really looking for a good long-term strategy here, as I can see shortcomings and want to tame this animal for the big things the future holds (multiple clients, types of work, etc). I am at about 60,000 records now, and am under the impression Access implodes after 100,000 or so. Plus internet connectivity is being setup (by me, the GM/IT guy :) ) as time permits (and as the boss relents and gives me internet access at remote locations).

Would the impending migration to a different SQL backend change your way of looking at this? There's a good chance Access will stay on the front-end in a couple locations, but the end result is going to be a SQL server back-ending Apache and PHP-driven scripts.

----
JBR
 
JBR,

The reason third normal form is where I stop in the case of Access is performance. This will hold true for all RDBMS applications as well.

The trade off is this if normalization is carried out to the nth degree which would be Boyce-Codd then the overhead to have the multitude of queries that would be need to denormalize the data so it could be related in a logical manner becomes huge.

The best approach regardless of the system being used is common sense.

Understanding that the data needs to be normalized so it will maintain it's ingregity and usefulness and can be related while not breaking the data down into atomic single records is where the balance has to be found.

The other part of the equation for you would be knowing if and where any bottle necks exist as you are pumping data through the system.

The number of records that Access can handle will depend greatly upon the number of fields and how calculations in queries on those fields occur. That being said 100K is a reasonable benchmark to go by.

I know that this does not give you any hard and fast rules but in the use of normalization the forms are defined but the level of utilization.


Steve
 
SteveR77 said:
the nth degree which would be Boyce-Codd
Care to provide a reference to substantiate that statement? To the best of my knowledge, BCNF does not take into accout the multi-valued dependancies that addressed in the 4th and 5th normal forms.

Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
That reminded of an exchange that took place during the Q&A session after a lecture by the late Dr. Codd himself. He had just finished addressing the rarity of 5th normal situations and someone asked a question about a "fully normalized database."

Dr. Codd's response was "As soon as I know what a fully-normalized database is, I'll let you know"

Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
Wow, you folks sure sound like you know your stuff! I appreciate the information. Some good search terms to throw at Google :)

I've been a 'hack-of-all-trades' for several years now, never really finding my niche in the IT field. But the last year or so working on some small personal projects with MySQL and this work project based on Access has me thinking that database design, etc is where I can excel. It seems to come pretty naturally, just that I tend to get too bogged down in little details and end up not getting much done because I spent all day worrying over a little minor thing ;-)

Well, enough of that. Back to the drawing board. As if there's not enough pain to endure already, I'm tinkering with the idea of rewriting the front-end of this animal in Visual Basic 6. Woo-hoo!

Thanks again!

----
JBR
 
Normalization was designed for the purpose of minimizing storage space and for ensuring data integrity.

If your data is more often queried than entered, then normalization even at the the third level will slow down your application. This is not to say you shouldn't normalize because data integrity is critical. But it is to say you should always consider if this can be done a differnt way without normalization to improve performance.

If you look at data warehouses they are alwmost never normalized because they recive their data from anothe normalized database so integrity is not a problem, but query efficiency is. (Data storage space having gotten cheap enough that it is not the concern it used to be.)

We now have other tools for ensuring data integrity as well. So what you need to do is balance all the competing needs as you design. I find it a good rule of thumb to consider how many levels of joins I will need to get the data back out. If it is more than 3 or 4, I will start to look at ways to denormalize which can minimize my joins.

Also if the only information I might need from the parent table in many of my queries on the child table comes from only two or three fields in addition to the key, I will probably transfer the information and store it in two places to speed up querying. Make sure when you do denormalze that you use triggers to ensure the data remains in synch. I find this is especially true when using look up tables like a list of states. If I'm going to use the long name of the state, I store it in the data table rather than (or in addition to) the 2 letter abbreviation to remove a join to the lookup table. When I first learned databases, space was at a premium and I would not do this, but now it is so much more efficient to avoid a join to a lookup table that I almost always do so. (I still use lookup tables extensively as they are good data sources for the combo boxes.)

As in all program design tasks, the critical element is to use your judgement, there is no one size fits all. As DBAs and programmers, judgement is one of the main things we are paid for. If you have given thought ot why you make a choice and are clear on the reasons, then the choice is probably correct. Doing something to arbitrarily meet some artifical rule (like the third normal form) without understanding the reasons why the rule was formulated in the first place will often give you a poorly designed system even if it appears to meet the rules.

Always question the assumptions behind the rules you know and ask if they still apply or if there is a differnt end to meet the same goal. If you look at something you did five years ago and can't see ways you would improve it now, you aren't learning enough.
 
Excellent help folks, I thank you all.

SQLSister said:
If your data is more often queried than entered, then normalization even at the the third level will slow down your application. This is not to say you shouldn't normalize because data integrity is critical. But it is to say you should always consider if this can be done a differnt way without normalization to improve performance.

That's interesting. A couple years ago someone pointed me in the direction of a Google for 'third form normalization', and led me to believe that was the way EVERY database should be. Each piece of data stored in exactly one place and all linked together by keys (that's not a verbatim quote, just the general message I got from him). Sounds like I may be sacrificing some performance at the cost of being overefficient with my data.

I will be entering some basic deliveries, entering some payroll, then linking the two via a middle table. That's nice, but where I'm stressing is where to keep the date the deliveries were done. It seems trivial, but my impression is that to have the date in two different places is just plain wrong. The method of updating it via a trigger is excellent, although I supposed that would involve moving from MySQL to PostgreSQL for my long-term solution (something I already wanted to do, but hadn't really pushed it yet).

It sounds like SteveR77 hit it on the head with "The best approach regardless of the system being used is common sense". I guess I'm just suffering from a severe case of "3rd normal form of over-analyzing the situation" and need to just sit down and do it.

----
JBR
 
Mixing apples and oranges here. The relational data model and the process of normalization is a logical model and process. It is independant of the physical storage layout.

You should always normalize, to at least third normal form, agreeing with SQLSister, that you understand what you're doing,in addition to following the rules, you logical database schema. You need that to understand the schema, and the semantics of the data.

When it comes to how best to physically implement that schema, then you bring into account usage factors, performance requirements, and the like to achieve
efficiency. This also facilitates the ability to create a logical schema to physical storage mapping, making it much easier to identify those area where physical implementation decisions could affect data integrity and the like.

Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
I want to make it clear that I'm not saying you shouldn't normalize (my databases are mostly normalized), just that you should consider other factors as well. Some people use normalization as the goal, but it isn't - performance and data integrity are the main goals. Normalization is just a tool used to achieve the goal of data integrity. It is good to design a normalized structure first and then determine what, if any, deviations you want to make. As Cajun said, it is easier to identify which changes to make and make sure integrity is maintained if you do it this way.

Never denormalize without an alternate way to ensure data integrity in place. Corrupted data integrity is the second worst data disaster that can befall you (hard drive failure with no backup being the first).
 
Here is something about Normalization written by Jerry Dennison (UtterAccess Guru). It has helped me many times...

Strictly speaking, addresses do not describe the entity that is people. And many would espouse that you separate addresses into their own table and relate them back to the people that reside at that address at that time. This is one of those grey areas where we all tend to violate the forms to a degree. This in no way implies that we should violate them with impunity. Now, when it comes to the definition of entity, attribute, and relationships these are actually quite clearly defined (they're just extremely difficult to understand in the raw form). Database objects have absolutely nothing to do with the data model. It is the data model that the definitions detail as far as I'm concerned. I may as well try to spell out my understanding of the Forms in as simple terms as possible.

First Normal Form:
Every table should have a Key <---- this means that each record must be uniquely identified
All ATTRIBUTES must be atomic <----- this means that there should be no repeating groups within a field (i.e. multiple values within a field separated by a comma or other delimiter) Strictly speaking, this phrase has nothing to do with repeating groups of fields or tables, that is actually covered (however esoterically) by other Forms. BUT, I tend to expand the meaning of this Form to include repeating groups wherever they may be found.

Second Normal Form:
Must be in First Normal Form <---- the forms are heiarchical, each is dependent on the one before it
A RELATION is in second normal form if each ATTRIBUTE is fully functionally dependent on the ENTIRE primary key <---- this means that no subset of the key can determine an attribute's value

Third Normal Form:
Must be in Second Normal Form
A RELATION is in third normal form when no non-key attribute is dependent on any other non-key attribute <---- this and 2NF are the primary Forms that prohibits the storage of calculated values or transitive dependencies.

BCNF
Must be in Third Normal Form
All candidate keys must satisfy the test for third normal form <---- a candidate key is of itself a potential unique identifier of the entity, generally speaking candidate keys are mutli-field constructs. This does not mean you should use a candidate key as the PK, it means that it could satisfy the requirements of uniqueness. For most entities, there are many candidate keys.

Fourth Normal Form
Must be in 3NF/BCNF
There can be no nontrivial multivalued dependencies in a relation <---- This is a fairly common reduction that most people achieve without even knowing it. This form prohibits independent multivalued components of the key. For example, if an employee can have many skills and many dependents you would move the skills and dependents to separate tables as they are not related in any way.

Fifth Normal Form
Must be in Fourth Normal Form
This is Nervana of DB design and is seldom reached. Basically, it advocates that you continue splitting the structure down until either of two states exist: that you've split so far that the resulting tables could not be joined to reconstruct the original, OR further splitting would be trivial.

Natually, this doesn't even come close to really describing what you're trying to accomplish. There are also definitions that need to be understood, specifically around what an entity is, what an attribute is, what a relation is, functional dependency (a tough one, the formal definiton of Functional Dependence is: For any relation R, attribute A is fully functionally dependent on attribute B if, for every valid instance, the value of B determines the value of A.), then of course there's multivalued dependency, trivial dependency, and last but not least candidate key.

Now, to boil all of this down to something usable. You must ask yourself the following:

Am I repeating groups? These can be multiple values in a single field, repeated TYPES of fields that share the same datatype and a common root name or root structure (these are the multiple date fields in Drew's db), or repeating tables of the same entity type (these can be noticed because they generally have the same fields but in different tables).

Am I trying to store derived or calculated values? (we should all know by now not to store calculated values)

Do I have multiple tables with more than one index based on multiple fields? (this usually indicates combining of entities, remember: an table is an entity and an entity a table)

Do I have a large number of values being repeated in a column (field)? This does not include FK's linked to another table's PK. It does include just about anything else. If you find that you're repeating a lot of values then you MAY need to move this to it's own table. This particular question is lowest in priority.


&quot;I know what you're t'inkin', ma petite. Dat Gambit... still de suave one, no?&quot;
 
jfgambit: you should paste that into a FAQ for this forum ;-)

Thanks for all the input. It's helped me rethink a lot of my layout, moving away from the 'kill yourself trying to be super-normalized common-sense be damned' approach I've been taking so far.

It's late, I'm tired. Y'all have a good night. Thanks again!

----
JBR
 
fgambit has it correct. 2nd-3rd normal forms are the mantra that NON-KEY attributes must depend "on the key, the whole key and nothing but the key, so help me Codd".

4th-5th normal forms are intimidating because of terms like multi-value dependency (MVD). However, I think that 4th/5th forms basically boil down to this: "don't finesse the 2nd and 3rd form compliance by moving NON-KEY attributes into the key." That's it; note that the mantra doesn't say anything about KEY attributes. If you use the minimum key (not a superkey), and follow the mantra, you should be in good shape. I agree with fgambit when he said that this is "a fairly common reduction that most people achieve without even knowing it."

Bill Kent, one of the fathers of normalization theory, has a nice, accessible article entitled "A Simple Guide to Five Normal Forms in Relational Database Theory" at
Regards,
-- Duke Ganote
 
SQLSister,
Normalization was designed for the purpose of minimizing storage space and for ensuring data integrity.

This might be nitpicking, but I would take issue with the assertion "...for the purpose of minimizing storage space." More accurately, this should be stated as "to eliminate data redundancy." Although this might seem to be a fine distinction, the two statements are not equivalent: eliminating redundancy itself promotes data integrity, while a similar benefit cannot be derived from physical efficiencies.

And, after all, if the goal was to minimize storage space, we wouldn't be using random access methods to begin with.
 
Hi there, I understand Normalisation to a degree, as I've always worked with data in 3NF. However, I do not understand BCNF, 4NF or 5NF, and would love to. If anyone could explain them in terms of a beginner, I'd be eternally grateful. Thanks :)
 
Hack of all trades."

That's me. I love it, flugh.

Best I've had up till now as a generalist, is "wearing many hats".
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top