Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL blanking column data when creating composite PK 3

Status
Not open for further replies.

1DMF

Programmer
Jan 18, 2005
8,795
GB
Hi,

Can someone explain why SQL would blank a column on a table, when I try to create a composite PK?

I have a table with 1.5M records (RecID, UserID, Area, Used), I want to replace the identity insert PK (RecID) with a PK made up of the three columns in the table and delete the old identity insert PK.

However, when I edit the table in Management Studio , remove the old PK and select the three columns (UserID, Area, Used) that I want for the composite PK, the UserID is blanked?

Basically it is completely corrupting the table when the PK index is created, why?

Thanks,
1DMF

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Ignore me! - it wasn't, I had some records that had blanks in them and of course the composite key meant it brought them all to the top (ordering), and on initial visual inspection, I imagined the worse!

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
I would prefer do that from command line and do in steps
1. create tempararry table for with structure you need (new PK and without RecID).
2. copy data from original table into your new table (create and run insert into query)
3. if everything copied and no errors, drop original table
4. rename tempararry to original table name

in this case you will have everything under control. The GUI in Management Studio generating and executing sql scripts and sometime if script fail you can lost data...
 
Hi gk53,

That was exactly what I did, copy table, process duplicate table (several times till I got the code right)!

Then do a swap with the live table, I was just thrown by the blank UserID's when the compsite PK was applied and I quickly scanned the top records, I panicked as I thought they were all blank, as in my mind I wasn't expecting to have any blank UserID's in the table at all!

Turned out it was @ 200 records out of 1.5M, so I lost 200 records in the refactoring / restructuring, not bad at all really, especialy when those 200 records were useless anyway as they weren't linked to a user!

Those records whould never have even been created if the table had been in proper normal form with the correct constraints applied in the first place!

It all worked out beautifully in the end :)

1DMF



"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
and sometime if script fail you can lost data...

I find it hard to believe that anyone has lost data this way. All generated scripts are within transactions that would be rolled back in case of failure.



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
I have to ask why you'd get rid of a surrogate primary key and replace it with a composite, information-filled primary key. Everything I know about database design says that unitary surrogate keys are the way to go.

Tamar
 
Well that isn't what the Open University just taught me?

The table wasn't in normal form, and wasn't even indexed, I am also archiving over 1M records to potentially data warehouse and aggregate. So this operational data table has just gone on a crash course diet!

If doing this yields issues, I can always re-introduce an auto-insert.

Can you explain why you believe introducing a unitary surrogate PK is preferable.

Thanks,
1DMF.

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Like Tamar, I prefer surrogate keys.

Mostly this is a preference thing. There are people with very strong opinions on this (for both sides). If you're interested, I would suggest a google search on:

Natural Key vs Surrogate key



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
I tend to think along the same lines as George. My primary keys should have no actual meaning to the database. So a surrogate key (usually a seeded integer) works best. That way...if I wanted to replace the Primary Key value of 346 with a real, physical banana, my database and apps wouldn't care (except that might mess up my server when I shove a banana in it). [bigsmile]

=======================================
People think it must be fun to be a super genius, but they don't realize how hard it is to put up with all the idiots in the world. (Calvin from Calvin And Hobbs)

Robert L. Johnson III
CCNA, CCDA, MCSA, CNA, Net+, A+, CHDP
C#.NET Programmer
 
Reasons I prefer a unitary, surrogate key:

1) Natural keys change, no matter how much the client says they won't. Then, you have to cascade changes down to multiple tables.

2) Surrogate keys are guaranteed to be unique. Many items that people use as natural keys aren't really unique.

3) Gaps don't matter for surrogate keys. When users can see keys (as with natural keys), they sometimes get worked out about missing values.

4) Surrogate keys tend to be smaller than natural keys, especially composite keys.

5) Composite keys require complex join conditions.

The only real downside to surrogate keys is that you can't always just look at the data and read it. You have to join a few tables sometimes to see the data you want.

Is the info you got suggesting natural, composite keys available online? I'd be interested to see what they say.

Tamar

 
>Well that isn't what the Open University just taught me?
What do you mean with that? Earlier threads (I didn't read)?

I'd say the camp pro natural and contra surrogate keys is a dying species. Just look at the wikipedia article on natural keys having only three links as references, very few, very old links.

I prefer newid() as default of a uniqueidentifier field over identity, because of the ease to create those keys offline and replicate data or have distributed data bases or offline data and later merge data back together. Uniqueidentifiers get rid of reason 3) of Tamars list, too.

Nothing speaks against combining a natural key from several columns to a secondary key, which also has the constraint to be unique, but would never be used as foreign key.

Some people pro natural keys highlight the documenting nature, they don't cascade pk changes to a foreign key by intention, to keep order data as printed, eg. It's a wrong argument for natural keys, but it's a valid concern. If you legally have to maintain 10 years of order data of customers, you also have to keep delivery addresses etc. as they were at the time of the order to reproduce them in case it would matter years later. You solve that problem in different ways, eg by keeping audit trail data.

Bye, Olaf.
 
I scoured the OU unit documentation (5 units of 912 pages) and found 1 tiny extract regarding surrogate keys on page 167-168 of unit 4, this course is a 10 month level 3 honours degree course, and this concept was mentioned once, in passing as a side note and certainly not endorsed as a requirement in any of the assignments, course work or exam?

Primary keys
In Table 4.1 where we summarised the transformation of an entity–relationship model into an SQL database definition, we stated that the identifiers of entity types became the primary keys of the corresponding tables. However, in the case of composite identifiers that comprise several attributes, this direct transformation may result in an inefficient database design. This is because of the overheads associated with indexing primary key values to facilitate, for example, the joining of tables to realise the relationships between those tables.

During the database design task, we may decide to replace a composite primary key with a single-column primary key to improve the overall efficiency of the design. Such primary keys are known as surrogate keys because they are invented solely for the purpose of distinguishing rows of a table and have no meaning within the enterprise for which the database is being built. Their values are often generated by the DBMS. SQL Anywhere enables you to define a numeric column as DEFAULT AUTOINCREMENT. This has the effect that when a new row is inserted, it does not need a value for this column because a new unique value is generated automatically by adding one to the current maximum for the column. A surrogate key is usually immutable, that is, the value does not change while the row exists....

...The main advantage of surrogate keys is improved indexing, which facilitates DBMS operations such as searching and joining tables. The main disadvantage is that surrogate keys are usually unrelated to the other data stored in a row of a table because they have no meaning within the enterprise for which the database is being built.

I have highlighted the fact that all this was caveated with the word may, it may be a problem and we may use a surrogate key. I certainly don't get from this tiny mention in a massive in-depth 10 month course that this is a standard way of doing things as most of you seem to imply?

2) Surrogate keys are guaranteed to be unique. Many items that people use as natural keys aren't really unique.
The fact someone incorrectly designed the table and set as the PK something that wasn't unique is not a reason to implement a surrogate key, a key is a key and is unique if it isn't unique it isn't a key.

I tend to think along the same lines as George. My primary keys should have no actual meaning to the database.
According to the OU that's what is wrong with surrogate keys, it's not a reason to use them, it's a reason not to.

Also the refactor I performed was on a table that has temporal data , by its nature the temporal attribute makes each record unique and is also non-operational data. The table is simply 'A member accessed B part of the members extranet on C date and time'.

(A = UserID) and is an FK to the members table, the area used is free text (B = Area) and there is a date/time (C = Used), this table when accessed is for historical purposes and will only ever be joined to the Members table for user / company name details.

To me a surrogate key in this situation is superfluous, has no meaning and won't affect system performance due to the historical / temporal nature of the data and the limited amount of joins and periodic access required.

Also the composite key ensures uniqueness and completeness of the data being inserted into the table, problems I noticed occurred when I had a surrogate key. Yes of course I know a unique index would need to be implemented across the main attributes to enforce this unique constraint, but that is creating a unique index, no different than a PK isn't it? So the composite key negates the need to create a unique index constraint as the PK will do it anyway. Also does having two unique index constraints (one PK and one across the composite attributes) cause performance issues or perhaps enhance them?

Of course, if you disagree or have any input I would be very grateful as I am keen to ensure I correctly apply the knowledge gained from my OU course, and of course forget anything they may have taught incorrectly!

Kind regards,
1DMF




"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
[quote}
2) Surrogate keys are guaranteed to be unique. Many items that people use as natural keys aren't really unique.
[/quote]

The fact someone incorrectly designed the table and set as the PK something that wasn't unique is not a reason to implement a surrogate key, a key is a key and is unique if it isn't unique it isn't a key.

Imagine for a moment that you pick a natural key that you think would work perfectly as a primary key. For example, suppose you choose to use Social Security Number for the primary key in the employees table. Now fast forward 5 years... your company just purchased a subsidiary in Spain, where employees don't have a social security number. What do you do? The other problem with social security numbers is that you cannot always trust that you get the "real" number. I was in a meeting with the New Jersey Department of Education. Someone suggested using SSN (for students) as a primary key. To paraphrase the response, "we don't always get legitimate SSN's, particularly from illegal immigrants".

Picking natural keys is a lot more difficult than most people realize. It's easy to get it wrong, and difficult to fix. Since surrogate keys don't have any real meaning, and every DBMS has an auto number column type, they are easy to implement and never need fixing.

There is another problem with natural keys not mentioned in the documentation that you shared. Arguably, this would only affect large tables, but it can be a real problem for scalability. By default, SQL Server creates a clustered index on your primary keys. This means that the data is physically sorted by your primary key. If you are dpoing a lot of inserts in to large tables, then you will end up with a lot of page splits, which is bad for database size and bad for performance. This problem can be mitigated by setting a reasonable value for your index fill factor (something like 80%). This will leave 20% blank space but will also leave room for row inserts.

With surrogate keys, rows are sorted by an identity column so new rows are always added to the end of the table. This will not cause any page splits but will open you up to "hotspots" in your database. If you are doing multiple simultaneous inserts, then multiple processes will be trying to write to the same 8k page on disk at the same time. This can lead to locking & blocking conditions.

All things considered, I still prefer surrogate keys. As I stated earlier, the natural vs. surrogate topic is almost like politics and religion. Conversations to be avoided if possible.


-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thank you George, as always your input is most appreciated.

I can see that choosing the correct natural key may be difficult and awkward to alter, but always using a surrogate key was definitely not the way the OU teaches, nor what normal form / relational theory teaches.

As I stated earlier, the natural vs. surrogate topic is almost like politics and religion. Conversations to be avoided if possible.
I didn't realise this was such a bone of contention, it certainly seems at complete loggerheads with what the theory and the OU are teaching, but as I am constantly finding, academia and the real world of actual implemented usage seem to be at opposite ends?

If the theory doesn't work in practice, why is it taught?

Though de-normalising tables is a prime example of the theory not working, but we were told only to optimise when problems occur, not to anticipate them by creating de-normalised tables in advance. I can see issues if you have a composite key of say 'four' attributes for example and then you want to implement a relationship between another entity, now the refer3encing table would require an additional four attributes to be stored just so it could implement the referential FK integrity, which is creating a lot of duplicate data. So yes, careful design, consideration of data usage and future relational integrity needs to be considered. I must also admin, I don't like the idea of a PK changing, so the usually immutable nature of a surrogate key feels better in some circumstances.

I'm just curious why in my particular usage scenario, removing the surrogate key was considered a bad move? I'm not sure it's an issue when you look at it in context of the data being modelled and the usage of the entity in question. Though I could be wrong and it may come back to bite me, but as I mentioned, re-introducing an auto-increment surrogate key to the table would be a 2 second job!

and every DBMS has an auto number column type
Not strictly true, PostgreSQL doesn't have an 'auto increment', it has a 'serial' or 'big serial' equivalent to 'Int' and 'BigInt, but actually the way it does this is by using sequences. You can set an attribute as serial, but if you removed the sequence, it wouldn't 'auto increment'. Keeping track of all those surrogate key sequences I feel could be rather tricky when dealing with a large amount of tables.

Does anyone have any experience of this with PostgreSQL?




"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
1DMF - I'm curious about when the material you cited was written.

Tamar
 
I join Tamar, your material must be very old.

For example:

The main disadvantage is that surrogate keys are usually unrelated to the other data stored in a row of a table because they have no meaning within the enterprise for which the database is being built.
This is totally wrong, it's an advantage a pk has no other meaning than the key it is for the record. You're learning a wrong thing. Outdated.

I can understand your position, as you don't know better, but this learning material should be replaced.

In regard to "Many items that people use as natural keys aren't really unique.": Of course the database also verifies the uniqueness of a natural key and so you can be sure your data has no double value, but it's not that, what Tamar addresses, it's the general idea. The most famous example is take first name and last name as a natural key. This can be unique for a small number of persons, but you know there are common first and last names and once a person with same name needs to be entered into your database you have not only the problem to extend it in the main table, eg with birthdate, but also need to add this in tables where the natural key is used as foreign key, those tables now need to store the birthdate, too. Now comes people migrating from countries often using 1.1. as there birhdate, you still have no real unique key, because the possibility exists, you get a conflict. You will need to store a wrong name or birthdate to keep the key uniqueness, this is nonsense, isn't it?

Bye, Olaf.
 
>Also does having two unique index constraints (one PK and one across the composite attributes) cause performance issues or perhaps enhance them?

Typically the index on a identity or serial surrogate primary key can be used to optimize joins, the key on your composite secondary key can be used to optimise sorting and filtering, you typically introduce many indexes to optimize queries anyway and so this is no disadvantage, too.

Like George mentioned, the clustered index and paging are an argument for MSSQL, but not every databaase organises it's data in pages. This is a disadvantage of this construct of a clustered index.

But don't take clustered index and primary key as synonyms, you can have a primary key, that is unclustered and chose composite attributes to make a clustered index better for querying data. Also see
Bye, Olaf.
 
1DMF - I'm curious about when the material you cited was written.
Well the front cover shows
First published 2007, second edition 2009.
Copyright ª 2007, 2009 The Open University

I also know that 2014 is the last year you can take this course as it is being retired.

This is totally wrong, it's an advantage a pk has no other meaning than the key it is for the record. You're learning a wrong thing. Outdated.

I don't know how you came to this conclusion, though I can accept that some of the course may be dated, BCNF, set theory and data modelling concepts taught on this course have been around since the 70's (that's 18 & 19).

On an ERD you don't show FK's and only model the domain of discourse, surrogate keys have no meaning and therefore have no existence in the domain of discourse.

A primary key is selected from the candidate keys of the relation, the relation is made up of just the data being modelled in the domain of discourse.

I've not come across a single example in any of the OU courses covering normal form and functional dependencies where it state you should introduce a surrogate alternate key and make it the primary key?

Of course, data modelling and relational theory are separate to actual implementation and implementation is environment specific.

So is this type of implementation of creating surrogate alternate keys and setting them as the primary key MS SQL implementation specific?

They also state the following in Unit 2 page 12...

Note that the choice of primary key is determined by the meaning of the relation rather than the particular values of the data.

The accompanying information on the meaning of a relation is as follows...

The meaning of a relation can be defined by specifying when a given tuple belongs to the relation by means of a natural language predicate. This defines feasible tuples in terms of the value of the primary key and is best illustrated by an example, as below.

<a, b, c> is a tuple of Enrolment if and only if a student with a StudentId of ‘a’ enrolling on a course with code ‘b’ does so on date ‘c’.

How can an attribute that has no meaning in the relation be a primary key based on this specific concept?

The most famous example is take first name and last name as a natural key

Why? these cannot be natural keys, there are loads of people called John Smith, Joe Blogs etc.. This was all covered as part of the basics on this course. I can see from George's example that things such as N.I. numbers being an issue and I think the biggest problem is ensuring you have a valid one!

However typo's and invalid data doesn't form part of relational theory. It has to assume, if you provide a value for something, the data provided is accurate as long as it comes from the correct data set, of course, it is probable you would create a domain for these values and have data validation check constraints applied, which would help eliminate some errors, but of course not all.

However, data validation / integrity wasn't really dealt with on this course, if it looks like an N.I. number, is formatted as a valid N.I. number and is unique in the entity for the domain of discourse being modelled, how can you possibly know it is wrong?

But when you model your data, you have to ask questions like, "Will I always accurately know this piece of information", if the answer is 'no' then it cannot be a PK by the definition of what a PK is.

Not everyone knows what their N.I. number is, and as PK's can't be NULL... it's ruled out from being a PK.

But then again in relational theory, nothing is allowed to be NULL, the only time NULL plays a part is during implementation of posted foreign keys for optional participation in a relationship.

All things considered, I still prefer surrogate keys. As I stated earlier, the natural vs. surrogate topic is almost like politics and religion. Conversations to be avoided if possible.

as they say, never a truer word spoken in jest!

I cannot believe a reputable organisation such as the OU, give out computing degree qualifications for courses they have taught for years and it is all totally wrong?

Considering the extremely high scores I achieved through out the course , I can't believe I didn't understand what they were teaching, and it certainly isn't what is being suggested here?

I can see the need at times to introduce a surrogate PK if and only if no there are no obvious candidate keys to select from for the data being modelled. However, I can't see the need to arbitrarily introduce a surrogate alternate key on every single entity in the domain of discourse as common practice, based on the set theory and relational modelling I have been taught.

But seriously do I need to ask the OU for my money back for having just wasted 10 months of my time doing a course that teaches something that isn't correct or usable in the real world of computing?

I understood what they taught, so the only explanation can be what they taught was wrong, if that's the case!

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
I don't think you wasted 10 months learning wrong or outdated material. There is nothing inherently wrong about natural keys. In my opinion, the only thing WRONG is learning one thing and applying it to everything. Natural keys have their place and surrogate keys have theirs.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top