Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Excessive pagiolatch_sh wait times on SAN

Status
Not open for further replies.

moonshadow

Programmer
Oct 9, 2001
181
0
0
GB
We have a 3rd party application which runs on a SQL Server box with a RAID drive. We attempted to move the database in question to a new SQL Server with all it's disk (apart from pagefile) on our NetApps SAN. One query in particular went from taking 35minutes to 2 hours 30 minutes. I have managed to create a SQL script which mimics this 3rd party query, and is entirely repeatable. I have done extensive testing and found that the query runs in about the same time on the SAN SQL Server as to the other server, if I move the database to it's local D: drive. When the DB is on the SAN, the query produces an explosion of activity for 10 minutes, and then everything on server & SAN goes quiet for the remainder of the duration (50 minutes for my query). Enterprise manager is reporting the query as waiting with a wait type of PAGEIOLATCH_SH. Using performance monitor I notice that the average latch wait time for this 2nd phase is 1000ms, with about 100 latches/sec. As I understand it a PAGEIOLATCH_SH occurs when a page of data isn't in the buffer pool (often due to insufficient RAM), and it must be requested from disk. These delays don't happen when run on local disk.
We've had a SAN Consultant in, and although he has corrected one configuration problem, and increased max throughput, the duration of the query stays the same.

I've watched the MS webcast on performance ( which is very helpful, and indicates we have an I/O subsystem issue. I've run SQLIO, and got some amazing throughput from the SAN (700M/s).
We're running on SQL 2000 Enterprise Edition SP4 on Windows 2003 Standard SP1 with 2GB RAM (clearly not enough, but if we add more, we'll just have the same problem when the database gets bigger).
We can't persuade the supplier to change their query, so I need to get to the bottom of this. I've run a short test on a similar SQL 2005 SAN server, and it ran in 10 minutes. I hope to install SQL 2005 on identical hardware as our SAN SQL Server to do a proper comparison.
Anyone any ideas where else I can look, we're completely stuck.
 
I post my script here for completeness. Although I worked hard to write such bad SQL, believe me it is nowhere near as bad as the original query (3 pages of A4!!).
I recreate the database, and data tables at the beginning but this is quick - about 5 minutes. Most of the time is spent running the crazy update statment. I've included the MAXDOP option as parallellism was clouding the issue. I also defragged the tables using SHOW CONTIG and DBCC DBREINDEX in the original system but this halved processing time on all servers involved.

Use Master

/****** Object: Database PerfTest Script Date: 23/06/2006 12:07:28 ******/
IF EXISTS (SELECT name FROM master.dbo.sysdatabases WHERE name = N'PerfTest')
DROP DATABASE [PerfTest]
GO
print 'Starting DB Create at ' + convert(varchar(20),GetDate(),113)
GO

CREATE DATABASE [PerfTest] ON (NAME = N'PerfTest', Filename='e:\mssql\data\PerfTest_Data.mdf', SIZE = 4500, FILEGROWTH = 10%) LOG ON (NAME = N'PrefTest_log', FileName='e:\mssql\data\PerfTest_log.ldf', SIZE = 500, FILEGROWTH = 10%)
GO

exec sp_dboption N'PerfTest', N'autoclose', N'false'
GO

exec sp_dboption N'PerfTest', N'bulkcopy', N'true'
GO

exec sp_dboption N'PerfTest', N'trunc. log', N'true'
GO

exec sp_dboption N'PerfTest', N'torn page detection', N'true'
GO
exec sp_dboption N'PerfTest', N'autoshrink', N'true'
GO
Use PerfTest
go
set nocount on

Create Table TableOfNumbers(NumberField float)

Create Table TableOfStrings(StringField char(1900))


declare @count int
Declare @str varchar(10)
declare @strlong varchar(1900)
print 'Creating small tables ' + convert(varchar(20),GetDate(),113)
set @count=1000
while @Count>0
begin
Insert into TableOfNumbers(NumberField) values (@Count)
set @Count=@Count-1
end
set @count=1000
while @Count>0
begin
set @str=convert(varchar(10),@count)
set @strLong=replicate(@str, 1900/len(@str))
Insert into TableOfStrings(StringField) values (@strLong)
set @Count=@Count-1
end
print 'Creating Big Table1 ' + convert(varchar(20),GetDate(),113)
Select NumberField, StringField, rand(NumberField)*10000 as RandomNumber
Into BigTable1
from TableOfNumbers , TableOfStrings
print 'Creating Big Table2 ' + convert(varchar(20),GetDate(),113)
Select NumberField, StringField, rand(NumberField)*10000 as RandomNumber
Into BigTable2
from TableOfStrings, TableOfNumbers

print 'Starting update query at ' + convert(varchar(20),GetDate(),113)
Update BigTable1 set Bigtable1.RandomNumber=
case when BigTable1.RandomNumber>BigTable2.RandomNumber then BigTable1.RandomNumber else BigTable2.RandomNumber end
FROM BigTable1 inner join BigTable2 on BigTable1.NumberField=BigTable2.NumberField and
BigTable1.StringField=BigTable2.StringField
option(maxdop 1)
print 'All done at ' + convert(varchar(20),GetDate(),113)
USE MASTER

 
Some more information after further testing. The average latch wait time of 1000ms is happening on the Server regardless of whether the database is on the SAN or the local drive. However, on the SAN the latch waits per second and page reads per second are both around the 100 mark, whereas on the local disk, the latch waits per second and page reads per second are about 300. It seems that when the database is on the local drive, SQL seems to be requesting, and getting 3 times as many pages per second. Why is this? Has SQL decided the SAN can't cope and is requesting less pages per second?
 
Run perfmon when you run this and see what the disk queue is at. Please post the results.

Denny
MCSA (2003) / MCDBA (SQL 2000) / MCTS (SQL 2005) / MCITP Database Administrator (SQL 2005)

--Anything is possible. All it takes is a little research. (Me)
[noevil]
 
The average disk queue length peaks at about 100 for the first phase, but after that it's sitting at just under 1 for the remainder of the query (the slow part) - this is true for when db is on SAN or local disk
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top