fuzzyocelot
Programmer
We’ve been having intermittent performance problems on our main production SQL server. It’s an active/passive cluster running SQL 2000 SP3a on Windows Server 2003 SP2. Each node has 8 GB of RAM. This server has a mix of 100+ databases. Some were developed by outside vendors and some were developed in-house. Ever since we put one particular vendor-developed database on this server, we’ve been having performance problems every so often. How can we prove if it’s because of this database or not?
Some days the response time of the server is extremely slow. It can get so bad at times we can’t even use Management Studio or Enterprise Manager to connect remotely. That’s what happened today. We couldn’t connect to the server remotely to figure out what was going on. By the time we got to the server room and looked at it there, it seemed to be running fine. Users were getting time outs left and right before that, though. What’s really bad is that it tried to do a failover to the other node and that failed! Our hardware people have looked at the error messags and still don’t know what happened. That’s a separate issue, though. Right now I’m trying to figure out how to prove whether or not this database is causing problems on the server independent of the failover problem (which is probably due to a hardware problem).
I tried running several basic SQL traces on the server (at different times) with the following events: SP:Recompile, SQL:StmtCompleted, SQL:BatchCompleted for all databases to see what I got. I filtered on CPU > 5000. I’m not even sure if the CPU is the problem or not.
I ran performance counters on the server covering Memory: Page Faults/sec, Processor:%ProcessorTime, Physical Disk, SQL Manager: Buffer cache hit ratio, and Page life expectancy. I didn’t see anything unusual. The page faults/sec were between 70-200. The physical disk read/writes were less than 1. The processor times were between 10-30. The buffer cache hit ratio was around 99. The page life exepectancy was between 300-2000. I hate having to wait for the problem to occur again before we can figure out what’s going on. By then, its usually too late and the server’s down.
From the results of the traces, I could see that there were a LOT of recompiles for 2 stored procedures in this particular database. Within a 10 minute span, they were recompiled for a total of 93 times! The first one was recompiled 60 times and the second one was recompiled 33 times. I looked at the stored procedures and they create/drop several temp tables. They also do a lot of updates to those tables. Some of the “create temp table” statements occur in the middle of the procedure. From what I’ve read this could cause recompiles. The physical tables they use contain millions of records too.
The reasons (integer codes) for the recompiles were 2, 3, and 5. Meaning, “statistics changed; object not found at compile time, deferred check to run time; temp table schema, binding, or permission changed.”
In one case, I ran the trace for 4 minutes and there were 570 exceptions alone! There were 38 recompiles mixed in with those for the same database. The error number was 208 which I guess means an object wasn’t found. All that for only 2 SPIDs.
There were a couple of other databases where quite a bit of processing was going on too. This is the one that stood out today because of all the recompiles.
So am I on the right track or is there something else I need to do? I can’t dink around with the database because it’s vendor-developed and supported. So it’s hands off unless the vendor gives permission. I don’t want to contact the vendor unless I know for sure it’s this database causing problems.
Any advice is appreciated! Thanks!!!
Some days the response time of the server is extremely slow. It can get so bad at times we can’t even use Management Studio or Enterprise Manager to connect remotely. That’s what happened today. We couldn’t connect to the server remotely to figure out what was going on. By the time we got to the server room and looked at it there, it seemed to be running fine. Users were getting time outs left and right before that, though. What’s really bad is that it tried to do a failover to the other node and that failed! Our hardware people have looked at the error messags and still don’t know what happened. That’s a separate issue, though. Right now I’m trying to figure out how to prove whether or not this database is causing problems on the server independent of the failover problem (which is probably due to a hardware problem).
I tried running several basic SQL traces on the server (at different times) with the following events: SP:Recompile, SQL:StmtCompleted, SQL:BatchCompleted for all databases to see what I got. I filtered on CPU > 5000. I’m not even sure if the CPU is the problem or not.
I ran performance counters on the server covering Memory: Page Faults/sec, Processor:%ProcessorTime, Physical Disk, SQL Manager: Buffer cache hit ratio, and Page life expectancy. I didn’t see anything unusual. The page faults/sec were between 70-200. The physical disk read/writes were less than 1. The processor times were between 10-30. The buffer cache hit ratio was around 99. The page life exepectancy was between 300-2000. I hate having to wait for the problem to occur again before we can figure out what’s going on. By then, its usually too late and the server’s down.
From the results of the traces, I could see that there were a LOT of recompiles for 2 stored procedures in this particular database. Within a 10 minute span, they were recompiled for a total of 93 times! The first one was recompiled 60 times and the second one was recompiled 33 times. I looked at the stored procedures and they create/drop several temp tables. They also do a lot of updates to those tables. Some of the “create temp table” statements occur in the middle of the procedure. From what I’ve read this could cause recompiles. The physical tables they use contain millions of records too.
The reasons (integer codes) for the recompiles were 2, 3, and 5. Meaning, “statistics changed; object not found at compile time, deferred check to run time; temp table schema, binding, or permission changed.”
In one case, I ran the trace for 4 minutes and there were 570 exceptions alone! There were 38 recompiles mixed in with those for the same database. The error number was 208 which I guess means an object wasn’t found. All that for only 2 SPIDs.
There were a couple of other databases where quite a bit of processing was going on too. This is the one that stood out today because of all the recompiles.
So am I on the right track or is there something else I need to do? I can’t dink around with the database because it’s vendor-developed and supported. So it’s hands off unless the vendor gives permission. I don’t want to contact the vendor unless I know for sure it’s this database causing problems.
Any advice is appreciated! Thanks!!!