Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

performance issue 5.3

Status
Not open for further replies.

balvey27

Technical User
Feb 24, 2009
24
US
I'm sorry for lack of detail of the system, I'm an Oracle DBA and don't have complete knowledge of the OS. Here's a general description of what I am seeing:

Our system has progressively gotten slower, it's been up for 65 days. Since the first part of Feb the load average has crept up from about 3.5 to around 12 now. When I'm looking within the database there isn't anything out of sorts. No run away processes or unusual activity. The CPU usage seems to have gradually grown over this time as well. We are using virtual servers and the CPU usage used to stay around 70% of assigned capacity, now it's over 110% consistently. When looking at the various user processes, none of them stand out as a CPU hog, their just all collectively using more CPU it looks like. Any thoughts?
 
here's a look at top cpu users. there is one large Oracle database running on this server, and two small ones. What are the 'wait' command processes?

Code:
jent1# ps aux | head -1; ps aux | sort -rn +2 | head -20
USER         PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
oracle   2863120  1.3  0.0 113532 154492      - A    13:59:36  0:46 oraclejde (LOCA
root       61470  0.6  0.0  384  384      - A      Dec 20 6340:15 wait
root       53274  0.5  0.0  384  384      - A      Dec 20 5633:46 wait
oracle   2441344  0.5  0.0 93136 134096      - A    13:59:39  0:18 oraclejde (LOCA
root       69666  0.4  0.0  384  384      - A      Dec 20 4823:49 wait
oracle   6410296  0.3  0.0 51624 51636  pts/6 A      Feb 23 52:20 /japp/oracle/pr
root       77862  0.2  0.0  384  384      - A      Dec 20 2738:27 wait
root       16392  0.2  0.0  640  640      - A      Dec 20 2361:43 lrud
root        8196  0.2  0.0  384  384      - A      Dec 20 2162:52 wait
oracle   6262972  0.2  0.0 91508 132468      - A    09:56:10  5:51 oracledsi (LOCA
root       94254  0.1  0.0  384  384      - A      Dec 20 658:04 wait
root       86058  0.1  0.0  384  384      - A      Dec 20 1420:33 wait
root       73764  0.1  0.0  384  384      - A      Dec 20 628:17 wait
root       65568  0.1  0.0  384  384      - A      Dec 20 1082:16 wait
root       57372  0.1  0.0  384  384      - A      Dec 20 1507:10 wait
oracle   6582306  0.1  0.0 92784 133744      - A    07:15:32  4:08 oraclejde (LOCA
oracle   5591110  0.1  0.0 101272 142232      - A    03:00:06  4:31 oraclejde (LOCA
oracle   5439510  0.1  0.0 92072 133032      - A    14:02:15  0:01 oraclejde (LOCA
oracle   4276246  0.1  0.0 92224 133184      - A    09:56:09  2:55 oracledsi (LOCA
oracle   3395786  0.1  0.0 94332 135292      - A      Feb 23 26:03 oraclejde (LOCA
 
here you go

Code:
vmstat 3 3

System configuration: lcpu=12 mem=32768MB ent=3.00

kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 4 13 5101391 27439   0   1   0 19301 89995   0 4221 20147 13124 35 47  6 12  3.35 111.7
 7 22 5098614 27375   0   3   1 17013 77399   0 4643 11640 10695 28 47 15 10  2.52  84.0
 7  7 5098360 27361   0   0   0 19311 101953   0 3956 7758 12290 29 51  7 13  3.22 107.3
 
You CPU run queue seems to be high! how many processors do you have on this machine?
 
Please post the output of the following commands:

vmstat 1 10
vmo -x| grep -iE "minperm|maxperm|lru_file|minfree|maxfree"
lsdev -C| grep aio
lsattr -El aio0
vmstat -v| tail -7
ioo -x| grep -iE "pv_min_pbuf|j2_maxPageReadAhead|j2_nBufferPerPagerDevice|j2_dynamicBufferPreallocation"

laters
zaxxon
 
Sorry forgot the code tags; anyway the wait process are just from the kernel to fill up the idle time. They don't consume any noticable performance. Nothing to worry about.

laters
zaxxon
 
And a 3rd one
@khalida

So run-queue should be ok. The lot of entries in the blocked-queue might come from locks in the database I guess. Observed similar on our heavy traffic Oracle DB.

laters
zaxxon
 
we are using virtual machines, and I know it's an ibm 550, with 8 processors, 3 are assigned to this particular server, which is 6 virtualized, and it looks like 12 with multithreading.

The database sees 12 cpus, and there is no locking going on in the database. The database looks fine overall. Processes are taking longer though because the server cant service them as quickly as it should be.

Could you explain the requested commands before I run them and possibly put them in a code block so I can see exactly what I'm supposed to type? Sorry, I just don't want to run anything detrimental on our production box.

 
ok, I tried them on our test machine first, here you go.

Code:
vmstat 1 10

System configuration: lcpu=12 mem=32768MB ent=3.00

kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 6 14 5743849 37050   0   0   0   0    0   0 2746 20905 8325 54 12 12 22  3.00 100.1
 1 16 5745436 33314   0   1   0   0    0   0 2227 20878 6592 35  8 45 12  1.96  65.3
 3 14 5742243 34861   0   0   0   0    0   0 1950 10124 5329 47  8 31 14  2.83  94.3
 2 11 5746757 28637   0   0   0 128  341   0 1906 14146 5408 67 12 16  6  3.37 112.3
22  6 5751165 27353   0   0   0 4640 23057   0 1869 12483 4611 90 10  0  0  4.66 155.5
17  9 5747297 32347   0   0   0 2975 16857   0 2575 21496 6189 80 12  1  6  5.58 186.0
 8 14 5749972 28238   0   0   0 529 3547   0 2258 12850 6341 56 11  6 28  3.44 114.8
 3  5 5743270 32441   0   0   0   0    0   0 2183 14096 6413 28  8 47 16  1.60  53.4
 5 10 5748774 27373   0   0   0 2328 23608   0 1915 12380 5527 65 14 18  3  3.38 112.6
17  9 5754490 27378   0   0   0 7237 96255   0 1373 8289 3988 89 11  0  0  4.64 154.6


vmo -x | grep -iE "minperm|maxperm|lru_file|minfree|maxfree"

lru_file_repage,0,1,0,0,1,boolean,D,
maxclient%,15,80,15,1,100,% memory,D,maxperm% minperm%
maxfree,5500,1088,5500,8,204800,4KB pages,D,minfree memory_frames
maxperm,7223414,,7223414,,,,S,
maxperm%,90,80,90,1,100,% memory,D,minperm% maxclient%
minfree,5468,960,5468,8,204800,4KB pages,D,maxfree memory_frames
minperm,401300,,401300,,,,S,
minperm%,5,20,5,1,100,% memory,D,maxperm% maxclient%
strict_maxclient,1,1,1,0,1,boolean,D,strict_maxperm
strict_maxperm,0,0,0,0,1,boolean,D,strict_maxclient

lsdev -C| grep aio

aio0        Available                 Asynchronous I/O (Legacy)
posix_aio0  Defined                   Posix Asynchronous I/O


lsattr -El aio0

autoconfig available STATE to be configured at system restart True
fastpath   enable    State of fast path                       True
kprocprio  39        Server PRIORITY                          True
maxreqs    4096      Maximum number of REQUESTS               True
maxservers 10        MAXIMUM number of servers per cpu        True
minservers 1         MINIMUM number of servers                True


vmstat -v| tail -7

                  744 pending disk I/Os blocked with no pbuf
                 4054 paging space I/Os blocked with no psbuf
                13823 filesystem I/Os blocked with no fsbuf
                    0 client filesystem I/Os blocked with no fsbuf
                 2486 external pager filesystem I/Os blocked with no fsbuf
                    0 Virtualized Partition Memory Page Faults
                 0.00 Time resolving virtualized partition memory page faults


ioo -x| grep -iE "pv_min_pbuf|j2_maxPageReadAhead|j2_nBufferPerPagerDevice|j2_dynamicBufferPreallocation"

j2_dynamicBufferPreallocation,16,16,16,0,256,16K slabs,D,
j2_maxPageReadAhead,128,128,128,0,65536,4KB pages,D,
j2_nBufferPerPagerDevice,1024,512,1024,512,262144,,M,
pv_min_pbuf,1024,512,1024,512,2147483647,,D,
 
You could try using 'crush' to clear out your mem.

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."
 
Looks fine so far, maybe check number of AIO-servers active with nmon -A and AIO-requests with iostat -A. Maybe a higher value for maxservers and minservers might help a bit for IOs, depending what you see with nmon -A on busy times. Also the maxreqs should be set to a higher value just in case you hit the limit - set it to 32768 - which would cause a crash with a relating error msg that you hit the max for AIO maxreqs.

Else I don't see any performance issue but these:
22 6 5751165 27353 0 0 0 4640 23057 0 1869 12483 4611 90 10 0 0 4.66 155.5

Do you have SMT activated? Can check this with "smtctl", which is recommended to be turned on for Oracle DBs iirc.

Maybe it might be worth trying to add 1-2 logical CPU to see if the stress tones down a bit.

AIXTHREAD_SCOPE=S is set in the profiles/environments of your instance user?


laters
zaxxon
 
what does that line that you quoted mean? is the 22 the run q? and the 6 the wait q?

Code:
smtctl

This system is SMT capable.

SMT is currently enabled.

SMT boot mode is not set.
SMT threads are bound to the same virtual processor.

proc0 has 2 SMT threads.
Bind processor 0 is bound with proc0
Bind processor 1 is bound with proc0


proc2 has 2 SMT threads.
Bind processor 2 is bound with proc2
Bind processor 3 is bound with proc2


proc4 has 2 SMT threads.
Bind processor 4 is bound with proc4
Bind processor 5 is bound with proc4


proc6 has 2 SMT threads.
Bind processor 6 is bound with proc6
Bind processor 7 is bound with proc6


proc8 has 2 SMT threads.
Bind processor 8 is bound with proc8
Bind processor 9 is bound with proc8


proc10 has 2 SMT threads.
Bind processor 10 is bound with proc10
Bind processor 11 is bound with proc10


AIXTHREAD_SCOPE=S

around the same time that performance started to go down, our network group made some domain controller server changes, and there was a change to our unix servers to have them synch their time with the NT server's side. I'm not really sure what impact if any that could have, or even how to check those processes, just thought I'd mention it.
 
Is there any way to view the cumulative amount of CPU processes have used?
 
Your max servers are low. If you look at your blocked queue and wait time you will see numbers that are high. There are also very high scanned:freed ratio for a brief sustained number of seconds.

I have written around a 13 page document on tuning AIX servers and another document that I believe is around 10 pages that is for AIX/databases; they were compiled after working with AIX for 14 years. I plan to put them on my website which I hope to have up by the end of this month.
 
VOILA!!!

There was a script stuck in a loop on another server that was continuously trying to make sqlplus connections over and over. this gradually was bringing the machine to it's knees.
 
Glad you found it :) Though I would put the aio0 maxreqs up to 32768 just in case you hit the current limit of 4096.

Also raising the minservers to 10 and the maxservers to 100 can improve performance in peak times. But as said, maybe monitor AIO with nmon -A and iostat -A.

laters
zaxxon
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top