Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

plenty of real mem, paging baaad.

Status
Not open for further replies.

exsnafu

Technical User
Apr 25, 2008
99
US
ok, so I have system that last week was on a P5 P570 and this week has been migrated to a P6 P570 server.

since the tech refresh, we've begun seeing periodic system hangs where for up to 30seconds we lose all response from the system and then it recovers. this appears to be when the system begins to page heavily and then stop. tuning params were left the same through migration. AIX 5.3 TL6 0811

48GB of real memory
8GB page space

runs an oracle documentum application and some java apps. all working storage for the apps is hosted via netapp filers. CIO is enabled on the netapp mounts.

now, for the details and why I need help understanding. the box doesn't seem overcommitted on real memory, my fre list remains fairly healthy. page faults in nmon is near constantly 10k or higher mark. odio/s on my sar output seems to be high and the system isn't just slowing down, it's downright occasionally freezing up:

my output:

vmo -a
cpu_scale_memp = 8
data_stagger_interval = 161
defps = 1
force_relalias_lite = 0
framesets = 2
htabscale = n/a
kernel_heap_psize = 4096
kernel_psize = 16777216
large_page_heap_size = 0
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
lru_file_repage = 1
lru_poll_interval = 10
lrubucket = 131072
maxclient% = 17
maxfree = 150000
maxperm = 694322
maxperm% = 20
maxpin = 2961549
maxpin% = 80
mbuf_heap_psize = 65536
memory_affinity = 1
memory_frames = 3670016
memplace_data = 2
memplace_mapped_file = 2
memplace_shm_anonymous = 2
memplace_shm_named = 2
memplace_stack = 2
memplace_text = 2
memplace_unmapped_file = 2
mempools = 4
minfree = 125000
minperm = 173579
minperm% = 5
nokilluid = 0
npskill = 16384
npsrpgmax = 131072
npsrpgmin = 98304
npsscrubmax = 131072
npsscrubmin = 98304
npswarn = 65536
num_spec_dataseg = 0
numpsblks = 2097152
page_steal_method = 0
pagecoloring = n/a
pinnable_frames = 3219621
pta_balance_threshold = n/a
relalias_percentage = 0
rpgclean = 0
rpgcontrol = 2
scrub = 0
scrubclean = 0
soft_min_lgpgs_vmpool = 0
spec_dataseg_int = 512
strict_maxclient = 1
strict_maxperm = 0
v_pinshm = 0
vm_modlist_threshold = -1
vmm_fork_policy = 1
vmm_mpsize_support = 1

vmstat -v
3670016 memory pages
3471618 lruable pages
1404554 free pages
4 memory pools
449877 pinned pages
80.0 maxpin percentage
5.0 minperm percentage
20.0 maxperm percentage
11.9 numperm percentage
415321 file pages
0.0 compressed percentage
0 compressed pages
11.9 numclient percentage
17.0 maxclient percentage
415321 client pages
0 remote pageouts scheduled
82192 pending disk I/Os blocked with no pbuf
8925690 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
328 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults

vmstat snapshot:
System configuration: lcpu=4 mem=14336MB ent=1.10

kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------------------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
2 3 2746337 1317200 0 197 0 569 3537 0 7709 92532 37534 49 46 2 3 1.53 139.3
1 2 2744825 1328270 0 198 0 5455 51059 0 6707 244644 26960 46 51 1 2 1.69 153.3
3 2 2744651 1327869 0 252 0 0 0 0 8165 142894 36313 59 37 2 2 1.77 161.3
5 0 2744799 1327302 0 159 0 0 0 0 6462 227474 26351 68 30 1 1 1.77 161.0
5 1 2748046 1323540 0 159 0 0 0 0 7553 167745 31993 65 33 1 1 1.92 174.2
8 1 2748656 1317482 0 204 0 764 5616 0 8015 61377 26428 75 24 0 1 1.95 177.4
3 1 2740395 1325212 0 206 0 0 0 0 8489 67894 38443 67 30 1 2 1.81 164.2
4 1 2738443 1326707 0 151 0 450 2696 0 8443 223273 34811 69 29 1 1 1.85 168.3
3 1 2738874 1325913 0 149 0 0 0 0 7396 225690 28995 66 32 1 1 1.84 167.4
10 1 2743489 1321071 0 164 0 1389 9392 0 6101 173810 18045 77 22 0 1 1.91 173.5

snapshot of sar -r

System configuration: lcpu=4 mem=14336MB ent=1.10 mode=Uncapped

09:48:56 slots cycle/s fault/s odio/s
09:48:58 732791 0.00 2031.46 4241.20
09:49:00 733213 0.00 2861.50 5468.00
09:49:02 733719 0.00 3018.41 6308.96
09:49:04 734687 0.00 3265.50 5636.50
09:49:06 735312 0.00 5361.89 4921.54
09:49:08 735992 0.00 4902.63 4995.24
09:49:10 736791 0.00 3792.04 5480.60
09:49:12 737377 0.00 8157.50 6745.00
09:49:14 738008 0.00 4363.00 7096.50
09:49:16 738654 0.00 7523.88 5018.41

Average 735654 0 4528 5591

nmon snap:
. Memory ..............................................................................
. Physical PageSpace | pages/sec In Out | FileSystemCache .
.% Used 63.3% 55.1% | to Paging Space 170.5 0.0 | (numperm) 8.5% .
.% Free 36.7% 44.9% | to File System 212.5 206.0 | Process 45.2% .
.MB Used 9068.8MB 4511.8MB | Page Scans 58679.8 | System 9.6% .
.MB Free 5267.2MB 3680.2MB | Page Cycles 0.0 | Free 36.7% .
.Total(MB) 14336.0MB 8192.0MB | Page Steals 12500.4 | ------ .
. | Page Faults 7499.4 | Total 100.0% .
.------------------------------------------------------------ | numclient 8.5% .
.Min/Maxperm 678MB( 5%) 2712MB( 19%) <--% of RAM | maxclient 16.1% .
.Min/Maxfree 125000 150000 Total Virtual 22.0GB | User 47.9% .
.Min/Maxpgahead 2 8 Accessed Virtual 10.5GB 47.9% Pinned 12.2% .
. .
. Kernel ..............................................................................
.RunQueue= 3.0 | swapIn = 1.0 | Directory Search | Kernel Processes .
.pswitch = 19424.8 | syscall= 54227.9 | iget = 0.0 | ksched= 0.0 .
.fork = 8.5 | read = 2326.0 | dirblk= 0.0 | koverf= 0.0 .
.exec = 6.5 | write = 659.5 | namei = 3442.0 | kexit = 0.0 .
.msg = 2.5 | readch = 11379617.4 | Load Averages .
.sem = 91.0 | writech= 424038.1 | 1 min = 3.31 .
.HW Intrp= 4770.9 | R+W(MB/s)= 11.3 | 5 min = 4.50 .
.SW Intrp= 382.5 | Up Time=5.0 days (max=497) | 15 min= 4.64 .


also during my periods of lockup i'll do an lvmstat and iostat, my tm_act% is 90-100% for both hdisk's and not surprisingly lvmstat is showing my hd6 paging space as the huge winner of all that time. the /tmp mount comes in a distant 2nd.

so, I inhereted these boxes from another that is no longer with the company. I'm concerned about my minperm/maxperm settings and the lru_file_repage=1, from what I've read and understand, setting minperm/maxperm at the lower end of the spectrum like it is is historically a good idea for oracle DB workloads where the database is handling its own caching etc. but in AIX 5.3 and on it seems the recommendation is leave min/maxperm at the default and enable lru_file_repage.. am I hurting myself with the current combination?

and a note on the workload, the oracle instance on this box i'm told is actually not very much, it's more some of these other java apps.. they're moving data around on my netapp filers and those mounts are using CIO, so with CIO enabled would I still be using much of my memory range in min/maxperm?

 
I would suggest you change minperm/maxperm to lower values!

Have a look into this link. It has nice comments on memory tuning:


Another good link on db cio performance guide:


And that's one more interesting comparison between cio and buffered JFS2!


Regards,
Khalid
 
woops, i noticed up there i said the system has 48gb real mem.. meant to say 14GB.
 
What is the ulimit value set for oracle user account? Have you set it to unlimited or a value which is more than the sum of memory and paging space?
 
FYI, oracle is using the default ulimits for the system and nothing is set more than real memory.

I did decide on friday to lower my minperm to 3% based on numperm/numclient numbers sometimes dipping down below the previous 5%. I also disabled lru_file_repage

now, what I see is that it appears that numperm will drop below 3% and sometimes when we drop below 3% on numperm we also then witness the heaving pi/po that results in a temporary lockup until things free themselves.

basically, lockups seem to occur while numperm is lower than minperm but not always while very low.

so my question, should I just go ahead and further drop minperm to 1% since I often see numperm around 1.5-2% during periods of lockup? am I barking up the wrong tree/getting tunnel vision?
 
Here you have set a low value for maxperm% (20) and maxclient% (17). I would recommend this to increase to 90% or 80% atleast. You need not drop minperm value further. On AIX 5.3, IBM recommends to keep these values high since AIX 5.3 memory management is different from the other versions.
 
turns out, for what its worth that the issue is someone had jacked around with the minfree/maxfree values on this(and other systems).

setting them back to default resolved my question.
 
Good News..I hope the maxperm is at 80% now which is the default value in AIX 5.3
 
yeah, i had set minperm/maxperm up to defaults already. the piece that i didnt understand with minfree was that it applies to each individual memory pool. I'd seen before that it was a high number of 512MB but since my fre list was never getting down to that I didn't think that was affecting it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top