Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL 2k5 Performance Issues: Replication Fix or Server Re-build? 2

Status
Not open for further replies.

fuzzyocelot

Programmer
Jul 22, 2003
333
US
Hey everyone!

We’re still having performance problems with one of our new SQL 2005 servers that uses merge replication. I realize there is a LOT of documentation out there on performance tuning. My head is swimming from trying to digest it all so I would really appreciate some advice. :)

There are people out in the field who have tablets. At night they plug the tablets in and the merge replication process kicks off for several hours. There are less than 30 tables plugged in at night. The performance problems occur all throughout the day, though. There are about 7 databases on the server that are used. The main one used as the Publisher is 7 GB. The others are much smaller. There is one distribution db on the same server. Other applications connect to the databases throughout the day. I’d say on average there’s less than 20 connections to the server. All connections are remote connections.

This is the only server that uses this kind of replication so we’re still trying to figure it all out. The vendor who upgraded the application also set up the replication settings. So we’ve been trying to get them to trouble-shoot the performance problems too. The last advice we heard from them was to read Microsoft’s technet site about replication. Lovely, huh?

I don’t know the specs of the last server prior to the upgrade as I wasn’t really involved in it much. I do know they had performance problems there as well and had to build the server up by putting more RAM on it. I don’t know if we need to do that here as well.

The server itself is running Windows 2003 Server SP2. We have SQL 2005 SP2 on it as well. It also has 4 GB of RAM DDR2, Dual CPU 3.8 GHz Xeon with 2 MB cache, and 15k rpm disks.

We thought about enabling AWE and re-configuring the max server memory to something like 3.5 GB to see if that helps. Right now it’s using the defaults. We would also like to try to set the databases compatibility levels to SQL Server 2005 (90) if they are compliant. The vendor said the server has to be run with SQL 2000 compatibility, but replication needs to stay at the SQL 2005 compatibility level.

I ran DBCC MEMORYSTATUS to see if anything showed up. Here are some of the results. I’ve been trying to analyze it but am having some trouble figuring it out. This kind of thing is not my forte. Let me know if anyone want to see anything specific:
Code:
Buffer Counts                  Buffers
------------------------------ --------------------
Committed                      202776
Target                         202776
Hashed                         98425
Stolen Potential               88662
External Reservation           0
Min Free                       240
Visible                        202776
Available Paging File          415793

Memory Manager                  KB 
------------------------------ --------------------
VM Reserved                    1682420
VM Committed                   1678732
AWE Allocated                  0
Reserved Memory                1024
Reserved Memory In Use         0

Memory node Id = 0              KB 
------------------------------ --------------------
VM Reserved                    1678260
VM Committed                   1674724
AWE Allocated                  0
MultiPage Allocator            24568
SinglePage Allocator           831800

Buffer Distribution            Buffers
------------------------------ -----------
Stolen                         2585
Free                           376
Cached                         101390
Database (clean)               86890
Database (dirty)               11532
I/O                            0
Latched                        3

(7 row(s) affected)

Buffer Counts                  Buffers
------------------------------ --------------------
Committed                      202776
Target                         202776
Hashed                         98425
Stolen Potential               88662
External Reservation           0
Min Free                       240
Visible                        202776
Available Paging File          415793

(8 row(s) affected)

Procedure Cache                Value
------------------------------ -----------
TotalProcs                     4448
TotalPages                     88619
InUsePages                     92

Optimization Queue             Value
------------------------------ --------------------
Overall Memory                 1331306496
Target Memory                  451420160
Last Notification              1
Timeout                        6
Early Termination Factor       5

I haven’t run Performance Monitor yet as I’m not familiar with it. If I did run it, what counters would be useful?

I tried to do a trace but couldn’t see any queries or anything else of interest.

I was in training for the last week so I haven’t been checking it lately. Yesterday I did a check on the Job Activity Monitor and found one job that’s been running since 5/30. It’s called "Replication monitoring refresher for distribution". According to one of the system tables it has been running since 5/30 and appears to be taking up a lot of CPU (6908843) and physical I/O (930565).

According to the sysprocesses table, the status for this job is "suspended". What I found strange is that this job is disabled yet it's running. I know this job runs the sp_replmonitorrefreshjob stored procedure which appears to be a system procedure for replication. So I searched through TechNet for the job name and found out that it refreshes the cache for replication. So I suppose we can try to reset the properties for it or something to see if it helps. Does anyone know much about this process? I’m waiting for the data owner to give the go-ahead to try resetting the properties for it. Does anyone think it’ll help? Should we rebuild the server? The data owner thinks that will help but we'd like to exhaust all our other options first.

I really appreciate any advice! TIA!!! :)

 
I haven’t run Performance Monitor yet as I’m not familiar with it. If I did run it, what counters would be useful?
Here are the counters you want to look at.
CPU
-Processor % time
-Processor Interrupts/sec

Memory
SQL Server Buffer Manager - Page Life Expectancy (should be about 300)
-Pages/sec
-Page Faults/sec

Disk
-Avg.Disk read/write
-% Disk time read/write

If you google these counters you will find acceptable limits.

Here are some very helpful performance queries to run to help find the bottle neck.
Code:
--highest CPU
select  
    highest_cpu_queries.plan_handle,  
    highest_cpu_queries.total_worker_time, 
    q.dbid, 
    q.objectid, 
    q.number, 
    q.encrypted, 
    q.[text] 
from  
    (select top 50  
        qs.plan_handle,  
        qs.total_worker_time 
    from  
        sys.dm_exec_query_stats qs 
    order by qs.total_worker_time desc) as highest_cpu_queries 
    cross apply sys.dm_exec_sql_text(plan_handle) as q 
order by highest_cpu_queries.total_worker_time desc


--queries running in parallel
select  
    r.session_id, 
    r.request_id, 
    max(isnull(exec_context_id, 0)) as number_of_workers, 
    r.sql_handle, 
    r.statement_start_offset, 
    r.statement_end_offset, 
    r.plan_handle 
from  
    sys.dm_exec_requests r 
    join sys.dm_os_tasks t on r.session_id = t.session_id 
    join sys.dm_exec_sessions s on r.session_id = s.session_id 
where  
    s.is_user_process = 0x1 
group by  
    r.session_id, r.request_id,  
    r.sql_handle, r.plan_handle,  
    r.statement_start_offset, r.statement_end_offset 
having max(isnull(exec_context_id, 0)) > 0

-- 
-- Find query plans that may run in parallel 
-- 
select  
    p.*,  
    q.*, 
    cp.plan_handle 
from  
    sys.dm_exec_cached_plans cp 
    cross apply sys.dm_exec_query_plan(cp.plan_handle) p 
    cross apply sys.dm_exec_sql_text(cp.plan_handle) as q 
where  
    cp.cacheobjtype = 'Compiled Plan' and 
    p.query_plan.value('declare namespace  
p="[URL unfurl="true"]http://schemas.microsoft.com/sqlserver/2004/07/showplan";[/URL] 
        max(//p:RelOp/@Parallel)', 'float') > 0

[b]

SELECT *
FROM sys.dm_os_wait_stats
ORDER BY waiting_tasks_count DESC

[/b]
--Use this to find wait type of susspended spids
SELECT *
FROM master.dbo.sysprocesses
WHERE spid = [b]your spid[/b]

--Will show query running
declare @handle binary(20)
select @handle = sql_handle 
from master.dbo.sysprocesses where spid = [b]your spid[/b]
select * from ::fn_get_sql(@handle)


--If you are seeing lots of SOS_SCHEDULER_YIELD in 
--your Wait States, that is a very stong indicator of CPU pressure.
--You can run the DMV query to confirm that:

-- Check SQL Server Schedulers to see if they are waiting on CPU

[b]
SELECT scheduler_id, current_tasks_count, runnable_tasks_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
[/b]
--If you see the runnable tasks count above zero, 
--that is cause for concern, and if you see it in double digits 
--for any length of time, that is cause for extreme concern!



--This query will give you an idea of 
--how many tasks are waiting in the system. 
--You can use this information to understand 
--blocking characteristics of your load
 
select count(*)
from sys.dm_os_waiting_tasks
where wait_type <> 'THREADPOOL'


select wait_type, count (*)
from sys.dm_os_waiting_tasks
group by wait_type
order by count (*) desc

--CPU BottleNeck

select  
    scheduler_id, 
    current_tasks_count, 
    runnable_tasks_count 
from  sys.dm_os_schedulers 
where scheduler_id < 255

--The following query gives you a high-level view of which 
--currently cached batches or procedures are using the most CPU
--The query aggregates the CPU consumed by all statements with the same
--plan__handle (meaning that they are part of the same batch or procedure).
--If a given plan_handle has more than one statement, you may have to drill in
--further to find
--the specific query that is the largest contributor to the overall CPU usage 

select top 50  
    sum(qs.total_worker_time) as total_cpu_time,  
    sum(qs.execution_count) as total_execution_count, 
    count(*) as  number_of_statements,  
    qs.plan_handle  
from  
    sys.dm_exec_query_stats qs 
group by qs.plan_handle 
order by sum(qs.total_worker_time) desc

--To find more details about plan run this query
-- takes either spid or plan_handle
SELECT *
FROM sys.dm_exec_query_plan(0x06000700185B93054043ED7F010000000000000000000000)

--   4. Q. Does my load have an active resource bottleneck?
--You can answer this question by looking at the resource address 
--that tasks are blocked on.  
--Keep in mind that not all wait types have resource associated with them. 

select resource_address, count (*)
from sys.dm_os_waiting_tasks
WHERE resource_address <> 0
group by resource_address
order by count (*) desc

--
--sys.dm_os_schedulers
--1. Q. Do I need to by more CPUs?
--In order to answer this question you have to find out if your 
--load is really CPU bounded.  Your load is really CPU bounded if a 
--number of runnable tasks per each scheduler always greater than 1 
--and all of your queries have correct plan.  
--The latter statement is very important, your load can be CPU bounded 
--due to the fact that somehow optimizer generated bad plan – 
--it can happen if your statistics out of date or you tried to 
--perform handcrafted optimization. In this case you don’t want to run to 
--Circuit City to buy more CPUs right a way – you want to fix the plan. 
--Here is the query to find out average length of a runable queue on the system:

select AVG (runnable_tasks_count)
from sys.dm_os_schedulers
where status = 'VISIBLE ONLINE'

--1. Q. How many sockets does my machine have?

select cpu_count/hyperthread_ratio AS sockets
from sys.dm_os_sys_info


--Q. How many either cores or logical CPU share the same socket?

select hyperthread_ratio AS cores_or_logical_cpus_per_socket
from sys.dm_os_sys_info
 
--4. Q. How much physical memory my machine has?

select physical_memory_in_bytes/1024 AS physical_memory_in_kb
from sys.dm_os_sys_info



select os.task_address, os.state, os.last_wait_type, clr.state, 
clr.forced_yield_count 
from sys.dm_os_workers os join sys.dm_clr_tasks clr 
   on (os.task_address = clr.sos_task_address) 
where clr.type = 'E_TYPE_USER'


--Parallelism can cause problems on very transaction databases.  Microsoft recommends turning it off.

sp_configure 'max degree of parallelism', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

- Paul
10qkyfp.gif

- If at first you don't succeed, find out if the loser gets anything.
 
AWESOME!!! Thank you for such a quick response!!! I will try it ASAP!!! :)
 
No problem. Post back if there is something you don't understand.

- Paul
10qkyfp.gif

- If at first you don't succeed, find out if the loser gets anything.
 
I ran the first query that shows the highest CPU. I tried it for the top 5 instead of top 50 for now. The results are below (minus the entire text field). To make it easier to look at, I also removed the plan_handle below and just put numbers in their place.

Code:
plan_handle	total_worker_time	dbid	objectid	number 	encrypted	text
1	          5270048979	        32767	213326272	1	    0	        create procedure sys.sp_MShelpmergeselectarticles
2	          3355058642	        32767	664463323	0	    0	        create function sys.fn_GetArticleSchemaVersionGuid 
3	          3347328066	        32767	884703858	1	    0	        create procedure sys.sp_MSmovegenzerochanges
4	          1211757400	        32767	103774808	1	    0	        create procedure sys.sp_MSmakegeneration
5	          503681700	         32767	701426807	1	    0	        create procedure sys.sp_MSenumchanges_belongtopartition

The dbid must be from one of the tablets because it doesn't match any of the db ids on the server. I don't know how to pinpoint which one, though. The tablets are all off site as is the server. I didn't think they were supposed to be connected now anyway. I checked the sysprocesses table and it doesn't look like any tablets are connected. I didn't even see that dbid (32767) listed. It's not listed in sysdatabases.

So it appears that the function and procedures shown from the highest cpu are for replication. It's interesting that it shows create statements. I'm not sure what this means. Any thoughts? :)
 
If you run this what do you get?

Code:
select db_name(32767)

- Paul
10qkyfp.gif

- If at first you don't succeed, find out if the loser gets anything.
 
Based on the memory info that you showed at the top SQL needs access to more memory. If the system is an x86 system then enable AWE as well as the /PAE and /3GB switches in the boot.ini and reboot. Then configure the memory to 3.5 Gigs.

If it's an x64 machine set the max memory to 3.5 Gigs.
How much memory does the host OS show as available?

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

--Anything is possible. All it takes is a little research. (Me)
[noevil]
 
Paul, when I run "select db_name(32767)", it returns a NULL value. So it doesn't exist on the main SQL server.

mrdenny, we are planning to enable AWE to see what happens. To do so, we have to go through change management which is a weekly meeting that takes place on Mondays. So we won't be able to try it until next week. When we do so, we're thinking of trying to set the max server memory to 3.5 GB. However, the memory shown on the server via My Computer properties is 3.50 GB. I could have sworn the other day is showed 3.75 GB. Interesting. I guess we need to set the max server memory to 3 GB then.

I did run the query regarding the server schedulers to see if they're waiting on CPU and the runnable tasks came to 0. Which I believe is good, right?

The max degree of parallelism is set to 0, which I believe means that it will use the actual number of available CPUs depending on the current system workload. I believe that is a good thing.

In the meantime, we'll probably also change the settings for the "Replication monitoring refresher for distribution" job so it doesn't run so often. It may help a little bit until we can enable AWE and so forth.

Thanks for your help so far! We're having a meeting tomorrow to discuss our options. Feel free to pass on any other ideas!

I'm still working on the other queries Paul suggested. I haven't had a lot of time to work on it as we're having performance problems with a different server. That one is running SQL 2000 with 1 GB RAM. Oh how fun! :)

Thanks!
 
After some more digging and thinking I am wondering... since the server is running SQL 2005 and has only 4 GB of RAM total, wouldn't enabling AWE be kind of pointless? Or am I misunderstanding the BOL? From what I've read about 2005, using the \3GB switch should be sufficient, that the max server memory is set only if you're enabling AWE, and that AWE is only set if you need to go over 4 GB. Is this accurate or am I "off my rocker"? :)

Thanks!
 
AWE is used to go past the 2 Gig limit.

You can probably get away without the /PAE switch as that is to get you over the 4 Gig limit.

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

--Anything is possible. All it takes is a little research. (Me)
[noevil]
 
So if I have to enable AWE to go over the 2GB limit, then do I also have to specify the max server memory as well?
 
So you always specify the max server memory.

If you are running SQL Server 2005 Standard Edition the "lock pages in memory" will not do anything for you. SQL Server 2005 Standard Edition does not support the "lock pages in memory" setting.

If you have SQL Server 2005 Enterprise Edition then you should at the very least test having the "lock pages in memory" setting enabled and see if it helps you out.

(I don't see where I've asked what edition of SQL Server 2005 you are running.)

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

--Anything is possible. All it takes is a little research. (Me)
[noevil]
 
This particular server is running SQL Server 2005 Standard Edition. So we won't worry about the "lock pages in memory" then. Whew!

One of our network guys put the /3GB switch in the ini file last night and rebooted. I didn't have time to enable AWE or the max server memory yet. I'm hoping to do that today or tomorrow. I will let you know how it goes! :)

Thank you guys so much for your help so far! I REALLY appreciate it! :)
 
By the way... when we enable AWE and set the max server memory, do we need to reboot the server? The /3GB switch is already in the boot.ini file and the server was rebooted afterwards.

Thanks!
 
You "shouldn't" need to restart SQL. A reboot of the server is not needed.

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

--Anything is possible. All it takes is a little research. (Me)
[noevil]
 
Last night I enabled AWE and set the max server memory to 3072 MB. So far so good! I'm waiting to hear from the client to see if they experience any changes in performance.

I ran performance counters this morning for an hour. I set it to poll the info every 1 minute. There is a big spike in the memory page faults/sec counter every 15 minutes. The average is between 1 and 600. The spikes are around 1400. This happened using the same counters last week. We were able to determine these spikes were due to replication merge jobs that are set to run every 15 minutes. These replication jobs were set this way as a recommendation from the vendor to help with performance. We're going to meet with the client later this week to discuss the current replication settings and other things.

I also noticed in the performance counters that the time for the read and writes to the physical disk has improved quite a bit. For example, the "PhysicalDisk(_Total)\% Disk Read Time" is currently running from about 0.003 to 7.9. Prior to the memory changes, it was running from about 28.75 to 634.27. So hopefully it's a good thing. :)

I'll post an update once I hear back from the client to see if they've noticed any performance changes.

Thanks for your help so far! :)
 
That's good news Fuzzy, Things are starting to look better.

- Paul
- If at first you don't succeed, find out if the loser gets anything.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top