Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

clustered index vs nonclustered inde 1

Status
Not open for further replies.

sds814

Programmer
Feb 18, 2008
164
US
I read this post: and just want to make sure I've understood the difference between clustered index and nonclustered index.

Clustered Index: actually sorts the data on the field(s) indicated.

Nonclustered Index: doesn't sort the data, but has a pointer that can find which page and row the data is located. For example, if I have a table with employee id, first name, last name. If the nonclustered index is on first name and last name then it's easier to fetch the data from the page and record.

Please let me know if I've understood correctly.
 
Your understanding of indexes is not quite right, so let try to explain.

Clustered Index: actually sorts the data on the field(s) indicated.

This is true. But what you need to understand is that the clustered index really IS the table. If your table has 20 columns, with one of them the index key, all of the data will be in the clustered index, and the data will be sorted by the key. By default, SQL server will create a clustered index with the primary key of the table as the index key.

Nonclustered Index: doesn't sort the data, but has a pointer that can find which page and row the data is located. For example, if I have a table with employee id, first name, last name. If the nonclustered index is on first name and last name then it's easier to fetch the data from the page and record.

This is kinda right, but kinda wrong too. You can think of a non-clustered index as another table. The columns that make up the key columns are actually stored in the non-clustered index. So, in your example, if you created a non-clustered index on firstname and lastname, that data is actually stored in the non-clustered index, and it IS sorted. The sort order is based on the column order of the index keys.

There is also a pointer in the non-clustered index that points back to the row in the clustered index. SQL Server can use the non-clustered index to quickly find the row and then use the pointer to get additional information.

The thread you pointed to talked about covering indexes. There is nothing in SQL Server for "covering" indexes. This is more of a concept than a syntax/object within the database. Going back to your example, if you had a non-clustered index on the table, and it included firstname, lastname and you wrote a query that selected only firstname and lastname, SQL Server would be able to get all of the data it needed from the non-clustered index. When this happens, the index is said to "cover" the query. On the other hand, if you wrote a query that returned firstname, lastname, and ShoeSize, the index would NOT cover that query. SQL server may still use the index but would need to to a bookmark lookup to the actual data in the clustered index so that it can return the ShoeSize column.

When you are creating indexes, there's a lot of things to think about. For example, you should be thinking about the way SQL Server stores data. Data is stored on the hard drive in 8K chunks. So, when SQL Server needs to get data from the hard drive, it always reads 8K at a time (even if it only needs part of that data). The same thing happens when data is written to the drive. Disk drive access is MANY times slower than memory access, so it is in your best interest to keep your tables narrow. You see, if you have a wide table (lots of columns and columns have a lot of data), then you will only get a couple rows per 8K page (maybe even just one row). With a narrow table, you will be able to get more rows in an 8K page.

This concept applies to indexes too. If you had a table with 20 columns, and many of those columns stored large strings, you will likely get only a small number or rows per 8K page. When you add an index to the table, the data in the columns is included in the index. If you use really wide columns, you will have a similar problem with the size of the index, and the index will be slow-ish too. However, if you include narrow columns you will get more rows per 8K page and the performance will be better.

You also need to consider what happens when you have multiple columns in an index. For example, if you have FirstName, LastName, the data in the index will be sorted that way. Then, depending on your query, the index may or may not be useful.

For example:
[tt]
Select FirstName, LastName
From YourTable
Where LastName = 'Smith'
[/tt]

Since LastName is the 2nd column in the index, the index would not be useful for this query. If the Where clause also included the first name, then it would be used.


[tt]
Select FirstName, LastName
From YourTable
Where FirstName = 'John'
And LastName = 'Smith'
[/tt]

Think of if this way. If I gave you a phone book and said find all the phone numbers for people that had Smith as the last name, it would be pretty easy. Knowing the names are ordered alphabetically by last name, you would quickly find the first entry and the last entry. Everything in between is the data you are looking for. Now, suppose I gave you a phone book and said "give me all the numbers for anyone with a first name of John". This would take you considerably longer because you would need to look at every entry to determine if it was the data you were looking for.

Indexing is a huge topic. Let me know if you have any additional questions.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
You can use the "Include" option when adding an index to provide a list of fields in addition to the indexed fields to store with the index. It takes up additional space but can improve query performance if all of the fields being selected in the query are "covered" by the indexed fields + included fields for an index.
 
Sorry George I'll have to take this step by step.

In regards to the clustered index, I'm not sure what you mean by "the data will be in the clustered index." Isn't a clustered index assigned to fields? For example, if I have 20 columns in a table and the clustered index is on first name and last name aren't the rest of the columns being sorted on the first name and last name?
 
What I really mean is...

There is no difference between a clustered index and the table itself.

Isn't a clustered index assigned to fields? For example, if I have 20 columns in a table and the clustered index is on first name and last name aren't the rest of the columns being sorted on the first name and last name?

In any index, the data is sorted based on the index keys.

With a clustered index, the other 18 columns are included in the index data (but not used for sorting).

With a non-clustered index, only the index keys and any included columns are stored in the index.

Maybe an example would help. Suppose you had this table.

[tt]
People

PersonId FirstName LastName ShoeSize EyeColor
1 George Washington 9 Blue
2 Abraham Lincoln 11 Brown
3 Theodore Roosevelt 8.5 Brown
[/tt]

If you had a clustered index on FirstName,LastName the data in the clustered index would look like this.

[tt]
PersonId FirstName LastName ShoeSize EyeColor
2 Abraham Lincoln 11 Brown
1 George Washington 9 Blue
3 Theodore Roosevelt 8.5 Brown
[/tt]

If you had a non-clustered index on ShoeSize, the data would look like this:

[tt]
PersonId ShoeSize
3 8.5
1 9
2 11
[/tt]

I'm lying a little here because the PersonId isn't necessarily stored in the non-clustered index, but there is an extra piece of data there that points back to the original table.

Anyway, the point is, with a clustered index, all of the data in the table is included in the index because the table itself and the clustered index are really the same thing (under the SQL Server hood).

With a non-clustered index, there is a row pointer (similar to the personid) and the data in the index key columns.

--------------------------
In my database, I have a MapStreets table with 30,000 rows. This data is primarily read from, so I have a lot of indexes on this table. When I run the following code....

Code:
SELECT  i.name AS IndexName,
        SUM(page_count * 8) AS IndexSizeKB
FROM    sys.dm_db_index_physical_stats(
             db_id(), object_id('[!]dbo.MapStreets[/!]'), NULL, NULL, 'DETAILED') AS s
        Inner JOIN sys.indexes AS i
           ON  s.[object_id] = i.[object_id] 
           AND s.index_id = i.index_id
GROUP BY i.name

exec sp_spaceused '[!]MapStreets[/!]'

The output is this...

[tt]

IndexName IndexSizeKB
------------------------------------------------ -----------
idx_mapStreets_EndLatitude 464
idx_MapStreets_EndLongitude 464
idx_MapStreets_MajorClassification_FeatureId 344
idx_MapStreets_StartLatitude 464
idx_MapStreets_StartLongitude 464
MapStreets_Distance 464
MapStreets_FeatureIdMajorClassification 344
MapStreets_FromNode 856
MapStreets_ToNode 584
MapStreets12 856
PK_MapStreets 3264

--------------

name rows reserved data index_size unused
----------- ------- -------- -------- ---------- -------
MapStreets 30465 9752 KB 3256 KB 5400 KB 1096 KB
[/tt]

Notice the index named PK_MapStreets. This is the clustered index, and it is (roughly) the same size as the table data. There are slight differences in the calculations to account for the minor differences.

If you add up the index sizes for all the non-clustered indexes, you get 5304 which is roughly the same as the index_size column (rounding errors and such).

Does this help you to understand?

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Yes, it does. The examples helped a lot especially with the Mapstreets table. Couple of follow up questions:

1) So the benefit of joining tables on primary keys is because sql server has already sorted the table on that clustered index. And therefore the performance for the query is faster. Is this correct?

2) The query below would take advantage of having a nonclustered index on shoesize. Would there be any advantage of having a clustered index on firstname and lastname?

Code:
select *
from people
where shoesize = 8.5


3) Would I create nonclustered indexes on fields that I use commonly in the where condition? What other reasons would be to create nonclustered indexes?

4) Why does the total size of the nonclustered index include the size of the clustered index? Is it because the pointer of the nonclustered index is pointing to a table that is sorted?

Thanks for your help George.
 
1) So the benefit of joining tables on primary keys is because sql server has already sorted the table on that clustered index. And therefore the performance for the query is faster. Is this correct?

Yes

2) The query below would take advantage of having a nonclustered index on shoesize. Would there be any advantage of having a clustered index on firstname and lastname?

select *
from people
where shoesize = 8.5


For this query, there would not be any advantage to having a clustered index on firstname and lastname.


3) Would I create nonclustered indexes on fields that I use commonly in the where condition? What other reasons would be to create nonclustered indexes?

Yes. You would also want to create non-clustered indexes on columns that you use to join to other tables. Often times these are foreign keys, but not always.

4) Why does the total size of the nonclustered index include the size of the clustered index? Is it because the pointer of the nonclustered index is pointing to a table that is sorted?

The size of each non-clustered index has nothing to do with the size of the table or the size of the clustered index. The size of a non-clustered index is based on the size of the index column(s) and the pointer back to the row with the rest of the data.

[tt]
name rows reserved data index_size unused
----------- ------- -------- -------- ---------- -------
MapStreets 30465 9752 KB 3256 KB 5400 KB 1096 KB
[/tt]

Take another look. The index_size is the sum of the sizes for each non-clustered index. The size of the data is the same as the size of the clustered index. Reserved is the total size for storing the table (clustered + non-clustered + unused).


-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
In regards to question #4:

"The size of a non-clustered index is based on the size of the index column(s) and the pointer back to the row with the rest of the data."

Isn't the nonclustered index pointing back to all of the rows in the table?

For example, the nonclustered index shoe-size where ShoeSize equals 8.5 is pointing back to the row:

3 Theodore Roosevelt 8.5 Brown

And the nonclustered index shoe-size where ShoeSize equals 11 is pointing back to its respective row in the table.

Couldn't we say that the size of the nonclustered index is the size of the index + size of the table?
 
Couldn't we say that the size of the nonclustered index is the size of the index + size of the table?

No.

The size of the table includes all of the data in all the rows. The pointer that I was referring to is a simple integer that points back to the original table.

According to wikipedia:
The disk space required to store the index is typically less than that required by the table since indices usually contain only the key-fields.

Going back to the shoe size example... the table has 3 rows and 5 columns. There is 1 int column, 3 varchar columns, and a decimal column. Each row would be (approximately)"

int = 4 bytes
3 varchars = approximately 30 bytes
decimal = approximately 5 bytes.

So, each row would take approximately 39 bytes. With 3 rows, the table would take approximately 117 bytes to store the table.

A non-clustered index on ShoeSize would be 1 int (for the row lookup) and 1 decimal for the ShoeSize data. That would be 4 bytes + 5 bytes = 9 bytes per row. Multiply by 3 rows and you have 27 bytes.

As you can see, the index is considerably smaller than the table itself.

Taking another look at my map streets table.

[tt]
IndexName IndexSizeKB
------------------------------------------------ -----------
idx_mapStreets_EndLatitude 464
idx_MapStreets_EndLongitude 464
idx_MapStreets_MajorClassification_FeatureId 344
idx_MapStreets_StartLatitude 464
idx_MapStreets_StartLongitude 464
MapStreets_Distance 464
MapStreets_FeatureIdMajorClassification 344
MapStreets_FromNode 856
MapStreets_ToNode 584
MapStreets12 856
PK_MapStreets [!](Clustered Index)[/!] 3264
[/tt]

All of the non-clustered indexes are smaller than the clustered index.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
You didn't mention the Order By clause. Doesn't this also affect what index would be used?
In your example(having an index on FirstName,LastName):
Code:
Select FirstName, LastName
From   YourTable
Where  LastName = 'Smith'

   The index would not be useful, but what if you had this:

Select FirstName, LastName
From   YourTable
Where  LastName = 'Smith'
Order By FirstName
Would That in anyway make the index more useful?
Or does the Order By clause not affect what index is used?

Thanks,
Jim
 
Does the Order By clause affect what index is to be used?

Yes.

If you have an index where the first column is the same as your order by, SQL Server can use that index to determine the order of the output so that it can eliminate a potentially "expensive" sort operation.

In the example you provided, where the filter is on the LastName and the order by is on the FirstName, SQL server would *probably* use an index on LastName and then sort the remaining rows.

It is tricky (and often times impossible) to guess the indexes that SQL Server will use when executing a query. You can turn on the execution plan and run the query to see what it uses, but preemptively guessing is tricky as your query increases in complexity.

SQL Server will use statistics to help determine which indexes should be used. Going back to the example with the filter on LastName and the order by on FirstName, I mentioned that SQL Server would probably use an index on LastName for the query. I say this because LastName is a good filter criteria. Suppose your table had 1,000,000 rows. A filter on last name would probably get you down to a couple hundred rows. Then, a final sort on the results.

Other things can affect which indexes are used too. Particularly JOIN clauses when you are using multiple tables in your query. Often times, it's the join clause the dictates which index is used.

One mistake I see often is to over-index a table. If you look at the information I posted about one of my tables, there are probably several indexes there that I don't really need. However, this table is rarely updated so I am not too concerned about having too many indexes. Periodically, it makes sense to take a look at the indexes you have on your tables to determine if they are really needed.


-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top