Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL blanking column data when creating composite PK 3

Status
Not open for further replies.

1DMF

Programmer
Jan 18, 2005
8,795
GB
Hi,

Can someone explain why SQL would blank a column on a table, when I try to create a composite PK?

I have a table with 1.5M records (RecID, UserID, Area, Used), I want to replace the identity insert PK (RecID) with a PK made up of the three columns in the table and delete the old identity insert PK.

However, when I edit the table in Management Studio , remove the old PK and select the three columns (UserID, Area, Used) that I want for the composite PK, the UserID is blanked?

Basically it is completely corrupting the table when the PK index is created, why?

Thanks,
1DMF

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Thanks George,

So the skill is knowing which one to use and when! (Just like your signature says!)

I emailed my tutor for clarification of this, and for guidance on further learning, so I can be better equipped to make such decisions during the design phase.

Happy new year,
1DMF.

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
I also don't think you wasted your time, just because of this detail.

We may conclude it's your choice where to use natural and where surrogate keys, but it's my choice and the choice of very many people, to use a separate single field, no matter if uniqueidentifier or identity or serial.

The reasoning is short and quite simple: You never know what single data in your table is wrong, or changes at a certain time. You may argue a change can be cascaded and that's good enough, but if databases are decentralized and distributed, that can get a very hard problem.

Here you have an example of an article tending towards surrogate keys, but not being religious about it:

Preceding the 10 tips, he writes:
Be practical, be reasonable, be realistic, and use the key you can apply with the most confidence.

I don't have 100% confidence in any data, not only could it be wrong, data and meanings of it can change over time. Even if you have a corporate product code or other unique key, some day you're asked to maintain data about products before they get a product code and for some reason giving a product it's product code is not the first step done. You simply can't go wrong with a surrogate key in all tables. There never will be the need to change something not belonging to the real data, so there never will be the need to cascade a change of it, as it's never changed. One single field and one single additional index doesn't introduce a size or performance problem.

Bye, Olaf.
 
I also don't think you wasted your time, just because of this detail.
Well it's not just a little detail, it was fundamental to the set theory and relational modelling taught as well as one of the first parts of the course, defining and selecting a key and then how to represent it in the ERD and describe it in relational syntax.

Interestingly that the snippet you show is exactly what I have been saying, and George surmised, it's about using the right choice for a particular circumstance. It is also about understanding your data.

If no other entity has a relationship with the table in question so there isn't a posted foreign key to worry about changing (value or meaning). The PK the table uses is for that table's use only and so optimising for joins that won't take place and data that won't change, can't be the right thing to do and so to me a natural key is the logical choice.

Hence my confusion and concern when comments such as "why did you remove a surrogate key" and "that's totally wrong" were made.

I care about having learnt correctly and implementing the knowledge accurately, so appreciate all the help and support TT provides.

Incidentally I have had the following response from my tutor...

I will try to answer briefly as this is one of those issues that will always have 2 camps. It is a bit like the music question... is vinyl better then CD (I am a vinyl man :) )

There are advantages and disadvantages of both natural keys and surrogate keys. You are correct that the module focuses on natural keys. That is because the module focuses on designing databases that model business requirements so it uses business attributes and keys. Surrogate keys do not have any business meaning.

So, yes if you are using natural keys and your business requirements change your keys may have to change which will be an overhead. Natural keys however are particularly useful for lookup/reference tables where you can be confident the key values will not change. A disadvantage would be if, lets say, you changed a StaffNo key from numeric to alphanumeric - you may have to change things in lots of places. However, you can get the same problem with poorly designed surrogate keys that run out of digits and have to be enlarged.

The main disadvantage of surrogate keys is that they're often not human readable making them difficult for end users to work with - especially if the system is not well documented and the designers have already moved on to pastures new. (a not uncommon situation)

In essence, there is no clear answer as to whether or not you should use natural keys or surrogate keys, regardless of what the zealots on either side of this religious argument may claim. The best approach is to apply one strategy or the other whenever it makes sense.

As I said at the start M359 teaching uses natural keys because the focus of the module is to teach how to model business requirements.

It seems we all came to the same conclusion, use what's right for the current situation, which is what I did when I removed the surrogate key, it wasn't needed and should not have been there!

Also this clarifies that the course is specifically bias towards natural keys and the reasons why, so in context, what the OU is teaching, isn't wrong at all.

I have found this a very useful discussion, thank you.

1DMF


"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Sorry, that I don't stop here.

>If no other entity has a relationship with the table in question so there isn't a posted foreign key to worry about changing (value or meaning).
The only tables with no relation into any other table I could imagine are log tables, maybe temp tables, staging tables. Uninteresting tables for database design.

>The main disadvantage of surrogate keys is that they're often not human readable making them difficult for end users to work with
Human readability of keys is unneccessary, you don't show surrogate keys to users, it's not of their concern how records are identified by your system, and as developer you have the knowledge to join data for your own browsing.

>specially if the system is not well documented and the designers have already moved on to pastures new.
There is referential integrity info in databases specifying parent/child table and primary/foreign key and that's all you need to know.

>you can get the same problem with poorly designed surrogate keys that run out of digits and have to be enlarged.
That comes from a time bytes where expensive. Of course you don't design surrogate keys too short. If you join with composite keys, you lose performance there and store data redundant, meaning a waste of space, too. The additional index on a 4 byte integer can save a lot of space in tables referencing the data with just a 4 byte value instead of strings with lengths of probably several 10 bytes. Even a 16 byte uniqueidentifier can be seen as short, today.

Last not least natural sounds good, organic. Surrogate sounds bad, artifical.

My conclusion, and you of course are allowed to answer, is again coming from someone not denying natural keys.

Quoted from
When people ask me which makes a better primary key—a natural value or a surrogate identifier—I answer that the choice depends on the situation. In all the modeling work that I've done, I've rarely been able to identify a suitable natural key as a candidate for the job. I've been able to identify natural search keys, both unique and nonunique. But when I apply the prerequisite tests—is it unique? does it apply to all rows? is it minimal? is it stable over time?—to these natural keys, most natural keys fail. My personal preference is to use a surrogate key unless I can identify an appropriate natural key that meets the four criteria for the primary key.

This is not the only conclusion I read about the unlikeliness of a natural key to fulfill all requirements.

Bye, Olaf.
 
Don't be sorry Olaf, if you feel you need to say something, go right ahead, that's what these forums are for!

It seems you are reiterating the basic conclusion of the thread, so having further citations confirming this is appreciated.

I think you miss-understood a little of what my tutor said, as when you say
Human readability of keys is unneccessary, you don't show surrogate keys to users...
You are confusing which humans they refer too, end users and user processes don't see half the data in an enterprise. These are not the people they refer too, end users don't model, design, implement and maintain the corporate domain of discourse.

You also seem to imply that every database has an ERD, DB Schema and even all relationships and weak entities are well defined, documented and appropriate constraints are in place, which simply isn't the case, perhaps it should be, but that isn't the reality.

I also don't believe they were referring to byte specific issues, a poorly defined auto increment surrogate key could simply be using a 'smallint' data type, and then running out of numbers. As bad an example as Firstname / Lastname for a surrogate key I know, but remember the course was about modelling an enterprise using set theory and relational theory, and many examples in the course used PK ID's such as...
Code:
CREATE DOMAIN Student_IDs AS CHAR(3)
CHECK (LEFT(VALUE,1) = 'S' AND CAST(RIGHT(VALUE,2) AS Integer IN (1..99) AND LENGTH(VALUE) = 3);
I just threw the above together so syntax may be a bit wrong, but you should get the jist that student id's are from 'S01' to 'S99', so what happens when you get 100 students?

This is the type of poorly designed PK I think they refer too, remember, not one column in the entire course was set to an identity auto insert! As explained, this has no meaning in the context of the course, it is not data belonging to the domain of discourse and therefore isn't modelled.

Regards,
1DMF

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Thanks, 1DMF

It sounded you wanted to end the discussion. In fact now we can only repeat what has been said, I think. You can always add bad examples for both surrogate and natural keys. But I haven't seen a good example of a natural key, yet.

I may have one from my developer life. The main table of one database is a projects table. projects are the highest in the hierarchy of that database. Projects have a 10 char short name which has to be upper case and unique. They are used this way since over 25 years and that hasn't changed. They halfways have no meaning, but some users tend to start their projects with their initials, for example. I could use this as natural key, but I still refuse to, even though it's just a single field. Instead I always introduce a GUID primary key column.

There are four locations in the world using that database, 3 of them share one (two locations use the app via TS) and 1 has it's separate instance. All data could be merged easily with the GUIDs not needing to change. Of course there could be double project short names, but that would then be mended adding an 11th char. I'd have the same problem using a natural key, I don't avoid it, but I will only need to mend this in the one projects table, the foreign keys will stay and not need to change. Only the project name shorts having doublettes would need to change and we may not even need an additional space as not all projects use up all 10 chars, anyway. That problem is harder, if the short would also be used as foreign key.

Bye, Olaf.
 
You can always add bad examples for both surrogate and natural keys. . But I haven't seen a good example of a natural key, yet.
hmm, I thought I posted one with my original question?

An audit table is immutable, temporal in nature (creating uniqueness), and has no relationship with any other entity other than the posted foreign keys it may reference.

It will also only ever be joined to the strong entity types it references via the posted foreign keys so no other entity would require a duplication of attributes and data to maintain any relationship.

Therefore the natural key of all attributes is correct usage for this particular circumstance, which I concluded based on the OU course and this discussion.

It is also by definition in BCNF as all attributes are either part of the primary key, that is irreducible and has no proper subset or transitive dependencies, or is an alternate key.

Of course I can see there could be serious issues using a composite natural key on dynamic operational data where other entities have relationships, it would also create data bloat and data duplication as well as make joins a complete nightmare, before the efficiency / performance is even considered.

Which is why I appreciate this discussion and thanks all those who have contributed.

It has confirmed I had used a natural key in the right place, but also gave a big heads up to serious potential issues if used inappropriately and highlighted a real danger with natural keys (especially composite ones), something that wasn't explained or even discussed on the course, which surprises me, due to the serious implications of using the wrong approach. PK's/FK's are fundamental to relational theory, yet this aspect wasn't even mentioned (other than that small snippet I posted).

I'm so glad I was sent on a bum steer with my original question, as it has ended in a discussion that has highlighted a serious issue that needs careful consideration when developing databases, something that could do with being emphasised on the course!

Regards,
Craig.



"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
The word audit appears the second time, first instance was my mentioning of how to cope with change history. We might be talking of different things here, I assume of rather low level auditing of record changes, about the way CDC (change data capture) does in MS SQL Server.

So, okay, an audit table to me is just a heap, it doesn't need any primary key, neither surrogate nor natural, if you ask me. It's not really part of the ERM, too, it's a system table for the audit functionality.

No matter if you audit trail by copying whole records or just the changed values, you will put both the primary key of the originating table in there and a timestamp or two, valid from/until, no PK. Even the PK of the original table can't be a PK here, because you can store different states of the same record, so the PK is duplicated. Adding a PK is not needed anyway, but a scenario, in which a PK may be handy is a system also auditing, if a record is rolled back to a previous state from the audit table and to which record, identified by a (surrogate) PKID of the audit table. Then you have a relation of this audit table with the restore/rollback functionality of an application, a relation between two system data tables.

You might mean something completely different with auditing, eg quality assurance.

Bye, Olaf.
 
The audit is simply a record created every time a member visits any section of the extranet, it stores the userid, the area they visited and a datetime, so yes you probably are correct in the fact that there is no real requirement of a PK. - The same situation you have with the 'Logins' audit table. A user logged in / out on this datetime.

The PK in question is a composite of all attributes as a user cannot go to the same place at the same time, and I guess in retrospect 'area visited' isn't necessarily part of the PK as a user can't go to two different places at the same time either! (highlighting how tricky finding the right natural key or defining the right constraints can be) - though in context not an issue!

This is in my mind still part of the domain of discourse as the enterprise requires this data (obtained during the requirements gathering stage) and it has a relationship with the user entity in the conceptual data model (CDM).

Yes it isn't operational data, but the course did cover temporal and historical entities, showing them as two identical tables, one storing current operational data and the other storing historical, and how to create a transaction to move the data from one to the other, it also included them in the ERD / CDM as part of the data in domain of discourse and we touched on data warehousing / data mining including snowflake / galaxy schema, albeit as an afterthought at the tail end of the course.

I was considering not having a PK and just having a unique index constraint, hence my earlier question in the difference (if there is one) between a PK index and a standard unique constraint index.

And guess what, the course didn't cover heap entities either (IIRC), it teaches every entity has a PK to be able to identify a unique tuple, and incidentally, this is the only time we were guided to introduce a surrogate PK, when no candidate key could be found! - hence how this entire discussion has snowballed!

So in this instance I guess the correct solution would be the most efficient for searching the table.

1. No PK and a standard unique index constraint across composite attributes.
2. PK across composite attributes. -> (Current implementation)
3. Surrogate (auto) PK and a standard unique index constraint across composite attributes.

Plus possibly including a clustered index across searchable attributes? - But would that help as there is a unique index across those attributes already?

Of course if I decide to refactor and exclude 'area visited' from the unique constraint (PK or not), then I would need to include that as part of a separate index for searching purposes.

Your input is most welcome.

Regards,
1DMF



"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Ok, I see.

Actually the terminology heap in MSSQL means a table without a clustered index. But some people argue that a clustered index on an identity integer also is a heap, as it doesn't sort data, new data just is appended as it always has the highest id. The question whether to use a clustered index or not is a question about how data is physically sorted and if it makes sense, the nodes of the clustered index are not record pointers, but the records themselves, that's why there only can be one clustered index.

If you often need access to data in a certain order, then the clustered index costs you more time at inserting new rows into filled pages, but read of nearby records / siblings is faster and in the normal case where you read much more often than you write, the performance hit of inserts and updates is very acceptable for the overall much better read performance. Some people argue only a clustered index defines a primary key in SQL Server, but a unique not nullable index enforces the primary key characteristics, too.

The physical sorting a clustered index does is wasted for a uniqueidentifier sorting data randomly or a identity sorting the data in chronological order. It is better be used on a natural key, not only for ordering purposes, but also for fetching a block of records close to each other in that order. The clustered index also can be used on fields not composing a fully unique value throughout the whole table, as the Server than adds a so called uniquifier ("If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier"). That means an almost unique compound index is enough. Of course a clulstered index is wasted on a bit field or a foreign key with very low number of different values.

A good reason to use the clustered index and primary key as synonyms is, once the record can be located by it's primary key value, the index node contains all columns, not just the indexed one. You can compensate that by adding frequently queried fields to your "normal" unique primary key index, too, if the natural key is unique, it will stay so, if you add more columns.

I for example store formulas (recipes) ingredients. A set of about 20 ingredients per formula in average, perhaps. If I use a GUID (random number) and use a clustered index, the advantage of the clustered index leaf nodes containing all columns is neglectable to the disadvantage the 20 ingredients are randomly spread in the database file. Having a clustered index on the formula FK and a primary and secondary order of the ingredients stores ingredients of the same formula close to each other, which is more important for fetching a full formula faster, if you ask me. Considering there are about a million formulas (variants). I didn't have the chance to do extensive tests and measurements, though.

Bye, Olaf.
 
Just want to add that since I learned about surrogate keys, I've found no situations in which I would consider a natural key better than a surrogate key. I've changed natural keys to surrogates in database I inherited. I never have encountered and can't imagine a situation in which I would replace a surrogate key with a natural key.

Tamar
 
One more thought about relations (warning: in database theory the term relation is synonym for a table, but here I am talking about the links of tables relating to each other):

1:1 relations aside, two tables relate to each other with the primary key of one and the foreign key of the other table. You model n:m relations with a table having two foreign keys and then have two 1:n relations from both sides into this table. Other relations aside of 1:n, n:m and 1:1 exist, but that's another topic.

The point is, if a table only is involved in relations with it's foreign keys, like in your example, it doesn't need a primary key at all. If you're teached any table involved in a ERM needs a primary key that's not totally wrong, but not the right reasoning. You can say a primary key is making a table a table, but relations can also end in foreign key only in some tables.

I understand you are learning a theory and are doing so well, and therefore that's all right. I doubt you later will work this way, though, unless you plan to do your own development as freelancer.

Bye, Olaf.
 
Hi Olaf,

Heap -> I took this to mean no ordering, simply stick the next record on the end (stack), which if you had a table with no PK or index, this is how it would store the records, though as you say a PK index on a auto increment technically has the same effect.

Relation/ship -> I too am using the terminology of relation to mean a relationship, you will notice I use the relational theory name 'entity' to mean 'table' as in ERD (Entity Relationship Diagram) ;-)

Yes you are correct the only relationship is a single foreign key to the user table, and as I said, I was actually considering that I don't need a PK and probably just want a unique index constraint and a clustered index for searching.

I have over 1M records to play with so in the interest of science, I'm going to do some testing.

Clone the table and implement the various options, natural PK with index, surrogate PK with index, no PK with index, perform some queries and see which one runs best.

I understand you are learning a theory and are doing so well, and therefore that's all right. I doubt you later will work this way, though, unless you plan to do your own development as freelancer.

Well sort of, I run my own hobby website which I am totally re-writing including a schema redesign porting from MySQL to PostgreSQL using this newly acquired knowledge.

I don't see the point of spending years studying something, to then ignore what was taught and do something completely different.

As we have concluded, it's knowing what method to use and when that is key and I plan on using my own site to hone these skills and learn when to break the theory, as well as when not too :)

Once I've completed my tests, I'll post the results.

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Well I got some interesting results!

I tried 4 permutations with the following query...

Code:
SELECT * 
FROM Tracking_Log_Archive t
JOIN Users u on t.UserID = u.UserID
WHERE t.UserID LIKE 'U00%-%'
ORDER BY Used ASC

1. Natural Composite PK - (PK clustered index) - took 23 seconds.

2. Surrogate [Identity] PK - (PK clustered index), with a non-clustered unique composite index - took 27 seconds.

3. No PK, with clustered unique composite index - took 23 seconds.

4. No PK, with a non-clustered unique composite index - took 24 seconds.

I repeatedly ran the test and consistently got the same results.

This to me at least highlights that no matter what, using a surrogate PK appears to definitely be the wrong option as it consistently performs significantly slower!

I'm assuming that a composite PK index and a clustered unique composite index across the same attributes is basically the same thing hence the consistent matching results.

I believe I was on the right track when I was thinking I probably didn't need a PK and just include a clustered unique composite index.

What's your thoughts on...

Natural PK composite index vs. No PK and a clustered unique composite index?

Regards,
1DMF



"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
I question the tests. 23 seconds is an eternity in terms of query performance especially with a table that only has a million rows.

If this was a production query, you would probably not return all of the columns, and you would probably create a supporting index, like this:

Code:
Create Index idx_TrackingLogArchive_UserId On Tracking_Log_Archive(UserId)
[code]

Can you create the index I show above and then re-run your tests.  I think you fill find that the query runs a lot faster (milliseconds vs. seconds).  Also, can you run this:

[code]
sp_spaceused 'Tracking_Log_Archive'
[code]

The query shown above will show the size of the table as well as index size.

After changing the structure of the table, can you run this:

[code]
DBCC CleanTable(0, 'Tracking_Log_Archive')

This command will reclaim unused space after dropping columns. Basically, if you have a surrogate key (identity), it will take 4 bytes per row to store. By dropping the column, you are not reclaiming the space used by it. Dropping the column and then cleaning the table will reclaim the unused space.

Basically, I want you to:

1. Create the index I suggest above.

2. Drop the identity column.
3. clean the table
4. run the space used command
5. run your first test (multiple times).

6. set up your table for the next test.
7. clean the table
8. run the space used command
9. run the next test multiple times

Repeat steps 6 through 9 for each of your other tests.

I think this will give a more realistic view of performance.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Hi George,

Thanks for the guidance, I actually have 4 cloned tables one for each test.

Yes I thought 23 was a little bit slow, which is why I considered the current 4 second gain a big saving as queries usually take milliseconds!

If this was a production query, you would probably not return all of the columns
True I guess, but as the processing of a SELECT is the last thing performed in query after the FROM and the WHERE intermediate tables have been created from a Cartesian product of all columns and records before any selection / filtering is made, I thought it was neither here nor there using the wildcard in the select for this particular test.

Here are the results..

Natural PK composite index. (+ your UserID index)
Code:
name	                rows	        reserved	data	        index_size	unused
Tracking_Log_Archive1	1010786    	98512 KB	51896 KB	46552 KB	64 KB
Took 23 seconds

Surrogate PK + unique index (+ your UserID index)
Code:
name	                rows	        reserved	data	        index_size	unused
Tracking_Log_Archive2	1010786    	129368 KB	55936 KB	73232 KB	200 KB
Took 25 seconds

No PK + unique non-clustered index (+ your UserID index)
Code:
name	                rows	        reserved	data	        index_size	unused
Tracking_Log_Archive3	1010786    	133272 KB	51896 KB	81280 KB	96 KB
Took 23 seconds

No PK + unique clustered index (+ your UserID index)
Code:
name	                rows	        reserved	data	        index_size	unused
Tracking_Log_Archive4	1010786    	98512 KB	51896 KB	46552 KB	64 KB
Took 23 seconds

All this seems to have achieved is increase the Surrogate PK performance by 2 seconds, but it is still slower than the others which haven't changed at all.

Though what are you calling slow? I get a return in the results pane in milliseconds, but the output in the Studio Manager GUI takes time to populate will all the returned records.

I'm using the counter in the SM GUI as a guide, is this the wrong 'timer' to be using?

Regards,
1DMF










"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
How many rows are returned by your query? Is it a lot? It may be taking milliseconds for sql server to "get" the data, but then 20+ seconds to send it to your workstation through the network and display it in the SSMS grid. If there's only a dozen or so rows returned, this time will be negligible.

If you are returning a large portion of your table, then it makes sense that the performance isn't improved by adding the additional index. It also accounts for the extra time because there's additional data to pass through the network and also to display on the screen.

True I guess, but as the processing of a SELECT is the last thing performed in query after the FROM and the WHERE intermediate tables have been created from a Cartesian product of all columns and records before any selection / filtering is made, I thought it was neither here nor there using the wildcard in the select for this particular test.

Unfortunately, the columns you return do make a difference. There's a concept called a "covering index". Take a look at this:
The SQL Engine, behind the scenes, will take in to account the columns returned prior to generating the execution plan. If all of the data in a query is contained within an index, SQL Server can ignore the larger table and only use the data in the index to speed up the query.

Bottom line... the data returned by a query absolutely makes a difference. If you want to time the internal execution while ignoring the network and display issues, you can do this:

Code:
DECLARE @Trash Int,
        @Start DateTime

Set @Start = GetDate()

SELECT @Trash = 1
FROM Tracking_Log_Archive t
JOIN Users u on t.UserID = u.UserID
WHERE t.UserID LIKE 'U00%-%'
ORDER BY Used ASC 

Select DateDiff(Millisecond, @Start, GetDate())

By selecting the data in to a throw away variable, you are really just capturing the time it takes for the SQL engine to process the query without regards to displaying it. Included in the code above is how I usually test performance. I create a variable, set the start time, and then display the time in milliseconds.



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Hi George,

There is no network involved here, sorry perhaps I didn't make myself clear, obviously I understand when transmitting the data the amount of data selected is a factor, but yes the query returns most of the records (874,954 out of 1,010,786)

I'm running this directly on the physical SQL server box, it's outputting direct to the server on the server, hence I didn't think the select made much of a difference.

Though it seems there are further gaps in this course content as they didn't mention 'covering index', well to be fair, it didn't cover indexes at all. We were simply provided with a waterfall type schema diagram showing the order in which queries are executed (including correlated sub queries) and it shows the SELECT as the last thing performed against the final intermediate table produced from the other clauses, I guess they oversimplified the process for the course.

Results for each test type are as follows...
1. 5592 - Natural PK
2. 6105 - Surrogate PK
3. 5783 - No PK , Clustered
4. 5846 - No PK, Non-Clustered

But I should point out that each time I ran the query I got different results, so I ran the test 10 times in a loop for each configuration and used the mean average for the result.

I used this code to produce the above result averages...
Code:
DECLARE @Trash Int,
        @Start DateTime,
        @sum Int,
        @cnt Int

SET @Start = GetDate()
SET @sum = 0
SET @cnt = 0

WHILE @cnt < 10
BEGIN
SELECT @Trash = 1
FROM Tracking_Log_Archive t JOIN Users u ON t.UserID = u.UserID
WHERE t.UserID LIKE 'U00%-%'
ORDER BY Used ASC 

SET @sum = @sum + (SELECT DateDiff(Millisecond, @Start, GetDate()))
SET @cnt = @cnt + 1

END

SELECT @sum / 10 AS average, @cnt AS cnt

But again the results varied. They were consistent enough that on average the surrogate PK performed the slowest, albeit at times only by 100's of a millisecond.

So the results shown were a decision to run each test again for a final time and post the result regardless of winner, so the surrogate PK result isn't the best I've seen, but the same goes for the other results I saw better and worse as I repeatedly ran the test.

I guess this also just goes to show that even outputting direct to the console on the server takes time regardless of table configuration and the affect a simple superfluous integer column can have.

So what's the conclusion regarding what I shall implement, well interestingly enough, the answer is different depending on whether the table is the 'production' gathering table, or the archive table.

As the production table is written to constantly, insert speed is my priority, so a heap type table would be preferred as it isn't queried that often, however, for the archive where more analysis, data mining and reporting is performed, speed of query is priority.

So the question is do I simply have no PK in either, but on the production table have a non-clustered unique index constraint and the archive table a clustered one.

Or do I on the production table include a surrogate PK which acts as a heap?

I have no requirement for this additional PK information, it's not used as an FK anywhere and will be lost when the data is archived, so what's the difference between a surrogate PK auto identity type heap table vs a non-clustered unique index heap table with regard to inserts?

Regards,
1DMF





"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
>23 seconds is an eternity in terms of query performance especially with a table that only has a million rows.
Well, as the query is a Full query of all fields, that merely measures the network speed to get all data over, doesn't it?

Also a where clause LIKE 'U00%-%' never will be optimized by an index, will it?

The query has no value for measuring index speeds.

>There is no network involved here
Even if you query at the physical machine of the server there is a transpport mechanism involved and the data is copied either via shared memory, pipes or TCP/IP Packets, so in the end the local NIC can still be involved. Especially if you query from SSMS 32 bit to a 64 bit SQL Server there is no chance to use shared memory.

Bye, Olaf.
 
Also a where clause LIKE 'U00%-%' never will be optimized by an index, will it?

Yes. If an appropriate index exists, the query optimizer will use it to speed up the query. This happens because (in this case) the first three characters are known. If the percent symbol was the first in the like pattern, then an index would not be used. Think of it this way... If you had a phone book and were told to get all the people with last name like 'Mas%', you would be able to get the list pretty quickly. If you were told to get a list of people from a phone book where name like '%ros', you would need to look at everyone to get your list of matches.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top