ok, so I have system that last week was on a P5 P570 and this week has been migrated to a P6 P570 server.
since the tech refresh, we've begun seeing periodic system hangs where for up to 30seconds we lose all response from the system and then it recovers. this appears to be when the system begins to page heavily and then stop. tuning params were left the same through migration. AIX 5.3 TL6 0811
48GB of real memory
8GB page space
runs an oracle documentum application and some java apps. all working storage for the apps is hosted via netapp filers. CIO is enabled on the netapp mounts.
now, for the details and why I need help understanding. the box doesn't seem overcommitted on real memory, my fre list remains fairly healthy. page faults in nmon is near constantly 10k or higher mark. odio/s on my sar output seems to be high and the system isn't just slowing down, it's downright occasionally freezing up:
my output:
vmo -a
cpu_scale_memp = 8
data_stagger_interval = 161
defps = 1
force_relalias_lite = 0
framesets = 2
htabscale = n/a
kernel_heap_psize = 4096
kernel_psize = 16777216
large_page_heap_size = 0
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
lru_file_repage = 1
lru_poll_interval = 10
lrubucket = 131072
maxclient% = 17
maxfree = 150000
maxperm = 694322
maxperm% = 20
maxpin = 2961549
maxpin% = 80
mbuf_heap_psize = 65536
memory_affinity = 1
memory_frames = 3670016
memplace_data = 2
memplace_mapped_file = 2
memplace_shm_anonymous = 2
memplace_shm_named = 2
memplace_stack = 2
memplace_text = 2
memplace_unmapped_file = 2
mempools = 4
minfree = 125000
minperm = 173579
minperm% = 5
nokilluid = 0
npskill = 16384
npsrpgmax = 131072
npsrpgmin = 98304
npsscrubmax = 131072
npsscrubmin = 98304
npswarn = 65536
num_spec_dataseg = 0
numpsblks = 2097152
page_steal_method = 0
pagecoloring = n/a
pinnable_frames = 3219621
pta_balance_threshold = n/a
relalias_percentage = 0
rpgclean = 0
rpgcontrol = 2
scrub = 0
scrubclean = 0
soft_min_lgpgs_vmpool = 0
spec_dataseg_int = 512
strict_maxclient = 1
strict_maxperm = 0
v_pinshm = 0
vm_modlist_threshold = -1
vmm_fork_policy = 1
vmm_mpsize_support = 1
vmstat -v
3670016 memory pages
3471618 lruable pages
1404554 free pages
4 memory pools
449877 pinned pages
80.0 maxpin percentage
5.0 minperm percentage
20.0 maxperm percentage
11.9 numperm percentage
415321 file pages
0.0 compressed percentage
0 compressed pages
11.9 numclient percentage
17.0 maxclient percentage
415321 client pages
0 remote pageouts scheduled
82192 pending disk I/Os blocked with no pbuf
8925690 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
328 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults
vmstat snapshot:
System configuration: lcpu=4 mem=14336MB ent=1.10
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------------------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
2 3 2746337 1317200 0 197 0 569 3537 0 7709 92532 37534 49 46 2 3 1.53 139.3
1 2 2744825 1328270 0 198 0 5455 51059 0 6707 244644 26960 46 51 1 2 1.69 153.3
3 2 2744651 1327869 0 252 0 0 0 0 8165 142894 36313 59 37 2 2 1.77 161.3
5 0 2744799 1327302 0 159 0 0 0 0 6462 227474 26351 68 30 1 1 1.77 161.0
5 1 2748046 1323540 0 159 0 0 0 0 7553 167745 31993 65 33 1 1 1.92 174.2
8 1 2748656 1317482 0 204 0 764 5616 0 8015 61377 26428 75 24 0 1 1.95 177.4
3 1 2740395 1325212 0 206 0 0 0 0 8489 67894 38443 67 30 1 2 1.81 164.2
4 1 2738443 1326707 0 151 0 450 2696 0 8443 223273 34811 69 29 1 1 1.85 168.3
3 1 2738874 1325913 0 149 0 0 0 0 7396 225690 28995 66 32 1 1 1.84 167.4
10 1 2743489 1321071 0 164 0 1389 9392 0 6101 173810 18045 77 22 0 1 1.91 173.5
snapshot of sar -r
System configuration: lcpu=4 mem=14336MB ent=1.10 mode=Uncapped
09:48:56 slots cycle/s fault/s odio/s
09:48:58 732791 0.00 2031.46 4241.20
09:49:00 733213 0.00 2861.50 5468.00
09:49:02 733719 0.00 3018.41 6308.96
09:49:04 734687 0.00 3265.50 5636.50
09:49:06 735312 0.00 5361.89 4921.54
09:49:08 735992 0.00 4902.63 4995.24
09:49:10 736791 0.00 3792.04 5480.60
09:49:12 737377 0.00 8157.50 6745.00
09:49:14 738008 0.00 4363.00 7096.50
09:49:16 738654 0.00 7523.88 5018.41
Average 735654 0 4528 5591
nmon snap:
. Memory ..............................................................................
. Physical PageSpace | pages/sec In Out | FileSystemCache .
.% Used 63.3% 55.1% | to Paging Space 170.5 0.0 | (numperm) 8.5% .
.% Free 36.7% 44.9% | to File System 212.5 206.0 | Process 45.2% .
.MB Used 9068.8MB 4511.8MB | Page Scans 58679.8 | System 9.6% .
.MB Free 5267.2MB 3680.2MB | Page Cycles 0.0 | Free 36.7% .
.Total(MB) 14336.0MB 8192.0MB | Page Steals 12500.4 | ------ .
. | Page Faults 7499.4 | Total 100.0% .
.------------------------------------------------------------ | numclient 8.5% .
.Min/Maxperm 678MB( 5%) 2712MB( 19%) <--% of RAM | maxclient 16.1% .
.Min/Maxfree 125000 150000 Total Virtual 22.0GB | User 47.9% .
.Min/Maxpgahead 2 8 Accessed Virtual 10.5GB 47.9% Pinned 12.2% .
. .
. Kernel ..............................................................................
.RunQueue= 3.0 | swapIn = 1.0 | Directory Search | Kernel Processes .
.pswitch = 19424.8 | syscall= 54227.9 | iget = 0.0 | ksched= 0.0 .
.fork = 8.5 | read = 2326.0 | dirblk= 0.0 | koverf= 0.0 .
.exec = 6.5 | write = 659.5 | namei = 3442.0 | kexit = 0.0 .
.msg = 2.5 | readch = 11379617.4 | Load Averages .
.sem = 91.0 | writech= 424038.1 | 1 min = 3.31 .
.HW Intrp= 4770.9 | R+W(MB/s)= 11.3 | 5 min = 4.50 .
.SW Intrp= 382.5 | Up Time=5.0 days (max=497) | 15 min= 4.64 .
also during my periods of lockup i'll do an lvmstat and iostat, my tm_act% is 90-100% for both hdisk's and not surprisingly lvmstat is showing my hd6 paging space as the huge winner of all that time. the /tmp mount comes in a distant 2nd.
so, I inhereted these boxes from another that is no longer with the company. I'm concerned about my minperm/maxperm settings and the lru_file_repage=1, from what I've read and understand, setting minperm/maxperm at the lower end of the spectrum like it is is historically a good idea for oracle DB workloads where the database is handling its own caching etc. but in AIX 5.3 and on it seems the recommendation is leave min/maxperm at the default and enable lru_file_repage.. am I hurting myself with the current combination?
and a note on the workload, the oracle instance on this box i'm told is actually not very much, it's more some of these other java apps.. they're moving data around on my netapp filers and those mounts are using CIO, so with CIO enabled would I still be using much of my memory range in min/maxperm?
since the tech refresh, we've begun seeing periodic system hangs where for up to 30seconds we lose all response from the system and then it recovers. this appears to be when the system begins to page heavily and then stop. tuning params were left the same through migration. AIX 5.3 TL6 0811
48GB of real memory
8GB page space
runs an oracle documentum application and some java apps. all working storage for the apps is hosted via netapp filers. CIO is enabled on the netapp mounts.
now, for the details and why I need help understanding. the box doesn't seem overcommitted on real memory, my fre list remains fairly healthy. page faults in nmon is near constantly 10k or higher mark. odio/s on my sar output seems to be high and the system isn't just slowing down, it's downright occasionally freezing up:
my output:
vmo -a
cpu_scale_memp = 8
data_stagger_interval = 161
defps = 1
force_relalias_lite = 0
framesets = 2
htabscale = n/a
kernel_heap_psize = 4096
kernel_psize = 16777216
large_page_heap_size = 0
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
lru_file_repage = 1
lru_poll_interval = 10
lrubucket = 131072
maxclient% = 17
maxfree = 150000
maxperm = 694322
maxperm% = 20
maxpin = 2961549
maxpin% = 80
mbuf_heap_psize = 65536
memory_affinity = 1
memory_frames = 3670016
memplace_data = 2
memplace_mapped_file = 2
memplace_shm_anonymous = 2
memplace_shm_named = 2
memplace_stack = 2
memplace_text = 2
memplace_unmapped_file = 2
mempools = 4
minfree = 125000
minperm = 173579
minperm% = 5
nokilluid = 0
npskill = 16384
npsrpgmax = 131072
npsrpgmin = 98304
npsscrubmax = 131072
npsscrubmin = 98304
npswarn = 65536
num_spec_dataseg = 0
numpsblks = 2097152
page_steal_method = 0
pagecoloring = n/a
pinnable_frames = 3219621
pta_balance_threshold = n/a
relalias_percentage = 0
rpgclean = 0
rpgcontrol = 2
scrub = 0
scrubclean = 0
soft_min_lgpgs_vmpool = 0
spec_dataseg_int = 512
strict_maxclient = 1
strict_maxperm = 0
v_pinshm = 0
vm_modlist_threshold = -1
vmm_fork_policy = 1
vmm_mpsize_support = 1
vmstat -v
3670016 memory pages
3471618 lruable pages
1404554 free pages
4 memory pools
449877 pinned pages
80.0 maxpin percentage
5.0 minperm percentage
20.0 maxperm percentage
11.9 numperm percentage
415321 file pages
0.0 compressed percentage
0 compressed pages
11.9 numclient percentage
17.0 maxclient percentage
415321 client pages
0 remote pageouts scheduled
82192 pending disk I/Os blocked with no pbuf
8925690 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
328 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults
vmstat snapshot:
System configuration: lcpu=4 mem=14336MB ent=1.10
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------------------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
2 3 2746337 1317200 0 197 0 569 3537 0 7709 92532 37534 49 46 2 3 1.53 139.3
1 2 2744825 1328270 0 198 0 5455 51059 0 6707 244644 26960 46 51 1 2 1.69 153.3
3 2 2744651 1327869 0 252 0 0 0 0 8165 142894 36313 59 37 2 2 1.77 161.3
5 0 2744799 1327302 0 159 0 0 0 0 6462 227474 26351 68 30 1 1 1.77 161.0
5 1 2748046 1323540 0 159 0 0 0 0 7553 167745 31993 65 33 1 1 1.92 174.2
8 1 2748656 1317482 0 204 0 764 5616 0 8015 61377 26428 75 24 0 1 1.95 177.4
3 1 2740395 1325212 0 206 0 0 0 0 8489 67894 38443 67 30 1 2 1.81 164.2
4 1 2738443 1326707 0 151 0 450 2696 0 8443 223273 34811 69 29 1 1 1.85 168.3
3 1 2738874 1325913 0 149 0 0 0 0 7396 225690 28995 66 32 1 1 1.84 167.4
10 1 2743489 1321071 0 164 0 1389 9392 0 6101 173810 18045 77 22 0 1 1.91 173.5
snapshot of sar -r
System configuration: lcpu=4 mem=14336MB ent=1.10 mode=Uncapped
09:48:56 slots cycle/s fault/s odio/s
09:48:58 732791 0.00 2031.46 4241.20
09:49:00 733213 0.00 2861.50 5468.00
09:49:02 733719 0.00 3018.41 6308.96
09:49:04 734687 0.00 3265.50 5636.50
09:49:06 735312 0.00 5361.89 4921.54
09:49:08 735992 0.00 4902.63 4995.24
09:49:10 736791 0.00 3792.04 5480.60
09:49:12 737377 0.00 8157.50 6745.00
09:49:14 738008 0.00 4363.00 7096.50
09:49:16 738654 0.00 7523.88 5018.41
Average 735654 0 4528 5591
nmon snap:
. Memory ..............................................................................
. Physical PageSpace | pages/sec In Out | FileSystemCache .
.% Used 63.3% 55.1% | to Paging Space 170.5 0.0 | (numperm) 8.5% .
.% Free 36.7% 44.9% | to File System 212.5 206.0 | Process 45.2% .
.MB Used 9068.8MB 4511.8MB | Page Scans 58679.8 | System 9.6% .
.MB Free 5267.2MB 3680.2MB | Page Cycles 0.0 | Free 36.7% .
.Total(MB) 14336.0MB 8192.0MB | Page Steals 12500.4 | ------ .
. | Page Faults 7499.4 | Total 100.0% .
.------------------------------------------------------------ | numclient 8.5% .
.Min/Maxperm 678MB( 5%) 2712MB( 19%) <--% of RAM | maxclient 16.1% .
.Min/Maxfree 125000 150000 Total Virtual 22.0GB | User 47.9% .
.Min/Maxpgahead 2 8 Accessed Virtual 10.5GB 47.9% Pinned 12.2% .
. .
. Kernel ..............................................................................
.RunQueue= 3.0 | swapIn = 1.0 | Directory Search | Kernel Processes .
.pswitch = 19424.8 | syscall= 54227.9 | iget = 0.0 | ksched= 0.0 .
.fork = 8.5 | read = 2326.0 | dirblk= 0.0 | koverf= 0.0 .
.exec = 6.5 | write = 659.5 | namei = 3442.0 | kexit = 0.0 .
.msg = 2.5 | readch = 11379617.4 | Load Averages .
.sem = 91.0 | writech= 424038.1 | 1 min = 3.31 .
.HW Intrp= 4770.9 | R+W(MB/s)= 11.3 | 5 min = 4.50 .
.SW Intrp= 382.5 | Up Time=5.0 days (max=497) | 15 min= 4.64 .
also during my periods of lockup i'll do an lvmstat and iostat, my tm_act% is 90-100% for both hdisk's and not surprisingly lvmstat is showing my hd6 paging space as the huge winner of all that time. the /tmp mount comes in a distant 2nd.
so, I inhereted these boxes from another that is no longer with the company. I'm concerned about my minperm/maxperm settings and the lru_file_repage=1, from what I've read and understand, setting minperm/maxperm at the lower end of the spectrum like it is is historically a good idea for oracle DB workloads where the database is handling its own caching etc. but in AIX 5.3 and on it seems the recommendation is leave min/maxperm at the default and enable lru_file_repage.. am I hurting myself with the current combination?
and a note on the workload, the oracle instance on this box i'm told is actually not very much, it's more some of these other java apps.. they're moving data around on my netapp filers and those mounts are using CIO, so with CIO enabled would I still be using much of my memory range in min/maxperm?