Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

To Cluster or not to cluster? That is the Question...

Status
Not open for further replies.

m0nkey

MIS
Mar 20, 2003
31
0
0
I have some large tables that i have as log/history file entry tables. The tables contain anywhere from 10 million to 50 million records that do not have a unique key as it is not necessary but have a clustered index on the account number. this is not a unique key, but could be if i add the date_entered column with that column. I do daily inserts of 1 million + records into this table and am always deleting records that have been in there longer than 90 days. I run select queries and bounce off this table as a "where acct_no not in (select acct_no from log_repository)" that run well but when i insert into this log/history table it takes forever, 1 1/2 hours. I have dropped the indexes and maintained the clustered index because when i drop both, creating the clustered index takes even longer (twice as long). The table only has 5 columns in it: acct_no, zipcode, insert_date, batch_no, campaign_no. i have a clustered index on acct_no and an index on insert_date and campaign_no. the acct_no is a char(10) field and the zipcode is a char(5).
Like i said, this table has a daily select and insert. it is not open to the public, just developing. it is a log/history table. i insert/delete in batches with the zipcode as start and end variables starting at 00000 and incrementing by 1000 every batch.

Does anyone have any sugessions as to optimize this table update/insert/delete as recreating this clustered index on the account_no is a chore but from the reading seems like the most efficient way.

Thanks in advance,
Mark
 
First
"where acct_no not in (select acct_no from log_repository)"
Will NEVER use an Index as you are asking sql to find all the records that don't equal the inner query, then you are asking it to return everyting else..

NO WINNER for that query .
My advise. Find a different way to write your query.

 
Thanks for the suggestion...This is what the query look likes for one of the inserts. there is a batch wrapper for this that goes off the zipcode in 1000 increments to break it up...any suggestions to help optimize this. i greatly appreciate it.

INSERT INTO prod_22 (acct_no,zipcode)
SELECT TOP 1000000 acct_no,
zipcode
FROM [20054_dup] With (tablockx, holdlock)
WHERE timezone = 5
AND acct_no NOT IN (select acct_no from import_repository WHERE create_date > dateadd(dd, -90, getdate()) and campaign = 1)
AND acct_no NOT IN (select acct_no from problem_data)
AND zipcode NOT IN (select zipcode from avoid_areas)

 
You said that you have a clustered index on the Account Number column. When you are inserting data into the table are you putting data into the middle of the table or are you appending your data to the end. Basically are the account numbers for the new records higher then the account numbers already in the table, or are the within the range of the account numbers in the table?

What is the fill factor of your clustered index?

Take a look at your execution plans to find your week spots.

What in the net effect you are looking for from the clustered index? Clustered indexes are most effective when used in BETWEEN statements.

Denny
MCSA (2003) / MCDBA (SQL 2000)

--Anything is possible. All it takes is a little research. (Me)

[noevil]
(Not quite so old any more.)
 
The account numbers in the History table that the data is imported into may already contain the account numbers in there. The table just houses the account numbers and when they were accessed and with which daily batch. They are accessed about every 90 days or more...the data is inserted i the middle because of the clustered index and thus some of the performance loss but the table is so big, 20 to 30 million records that droping and recreating the clustered index is painful.
The fill factor on that index is 90.
the net effect for that index is the ability to pull data from the source table and make sure that i am not pulling data that has already been accessed in the last 90 days for that particular campaign...so i keep this repository table for that reason. the clustered index helps in pulling that data faster when it references that account number as clustered in the history table.

the above insert is an example of one of the inserts into the table...there ismore to the query but that is abridged version...

Thanks again...
 
Mass inserts with unordered clustered keys can be quite intensive... is it possible to swap indexes, e.g. cluster on insert_date and noncluster on acct_no?

I guess common SELECTs on that table will get slower (more bookmark lookups etc), but table is very narrow (only 5 columns) and I/O won't get too bad.

Also: how many INSERTs are typically performed per single batch?

------
"There's a man... He's bald and wears a short-sleeved shirt, and somehow he's very important to me. I think his name is Homer."
(Jack O'Neill, Stargate)
[banghead]
 
in that case is it necessary to even have a clustered index on the table...is it doing more harm than good?

My batches are probably around 10000 each, give or take...they are different every time...

thanks for the replies...
 
could anyone please lead me in the right direction to make my query a bit more efficient using joins as opposed to the not exists subquery. the query is below...

INSERT INTO prod_22 (acct_no,zipcode)
SELECT TOP 1000000 acct_no, zipcode
FROM [20054_dup] With (tablockx, holdlock)
WHERE timezone = 5
AND acct_no NOT IN (select acct_no from import_repository WHERE create_date > dateadd(dd, -90, getdate()) and campaign = 1)
AND acct_no NOT IN (select acct_no from problem_data)
AND zipcode NOT IN (select zipcode from avoid_areas)


thank you much...
 
> in that case is it necessary to even have a clustered index on the table...is it doing more harm than good?

For INSERTs - more harm than good, just like with any type of index. SELECTs are different story though.

Btw. post some additional info about tables involved - how many rows, available indexes etc. Estimated exec plan would be also nice.

------
"There's a man... He's bald and wears a short-sleeved shirt, and somehow he's very important to me. I think his name is Homer."
(Jack O'Neill, Stargate)
[banghead]
 
With reguard to your clustered index, I would recommend rebuilding the index with a lower fill factor. Somelike like say 60-70%. Give your self a lot of room to work with in those pages. That way as data is inserted into the index less page shifting will need to be done.

Denny
MCSA (2003) / MCDBA (SQL 2000)

--Anything is possible. All it takes is a little research. (Me)

[noevil]
(Not quite so old any more.)
 
Do you have CREATE TABLE privileges on the database? If so, would there be a benefit to creating tables with data covering a limited timespan, say 1 day for example? Tables that contain overage data can then be dropped w/o expensive DELETE operations. Index operations on smaller tables would also be quicker.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top