Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Phoenix on H70

Status
Not open for further replies.

promise1

IS-IT--Management
Sep 3, 2000
46
NG
i would like help on this issue. i have phoenix banking application running on aix 4.3.3 on h70 servers.

Ever so often , my banking application hangs and the h70 servers would have to be rebooted and the phoenix services started again.

the h70 is a dual processor server buu only one processor was used for the phoenix configuration (existed b/4 my employment)

Any help required ASAP as we can't keep rebooting the server.
 
Whether it is dual processor or not, you are probably stuck. It depends why your system hangs... it could be
the app (I would definately check with the app vendor and see if there are any patches for it. I would also check for patches for AIX...if it is 4.3.3 they are up to maintennace level 7....or you could be out of paging space... Any way....the below may help you narrow it down.

============================
System Hangs


Several things can cause a hang of the system, but it is important to try and figure out what changed.

Did the system recently get updated? Are there now more users than before?A new program? A new UPS with software? Additional users added?

Does it hang all the time? End of the month?

What happens to the console? Do all ports and telnet sessions hang? Is the console still working? What was required to make it come back? A system can hang for a number of reasons including: lack of paging space, running out of space in a root filesystem (/ or /etc or /dev), not enough resources or mbuffs, downlevel AIX software, not applying latest patches at recommended maintenance level, memory leaks, hardware going bad, etc.

Some things that may happen When a system hangs relating to Paging Space:
Processes requesting additional memory are killed once the system runs low on paging space. The system appears hung as new processes and telnet connections are terminated. Error messages such as Not enough memory or Fork function failed are generated.
1.Add additional paging space. To know how much paging space is "enough", use the lsps -s command often to get a feel for the %Used of the paging space. Based on this percentage, a system at its maximum workload should have no more than 80% of paging space used. Example output of the command lsps -s looks like the following:
Total Paging Space Percent Used
200MB 51%
Anything over 51% is suspect, and I would consider adding paging space.

2.Systems often have plenty of paging space (sometimes 3-4 times RAM) and can still run out. This could be due to a memory leak. The question then is which process is causing the memory leak. Discussed below are ways to find out what process is causing the memory leak and the tools used to accomplish this task.

a.The command ps vg provides useful information. In this case the data in the column labeled SIZE is needed. The SIZE column reports virtual memory (paging space) usage on a per-process basis, in 1KB units. Sample output from ps vg | pg looks like the following:

PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU
MEM COMMAND
0 - A 87:42 6 20 8 xx 0 0 0.1 0.0 swapper
1 - A 191:58 94 240 240 xx 25 28 0.3 0.0 /etc/init
516 - A 70228:47 0 16 20 xx 0 0 97.0 0.0 kproc
774 - A 5:53 1 24 28 xx 0 0 0.0 0.0 kproc
1032 - A 28:40 0 56 56 xx 0 0 0.0 0.0 kproc
1866 - A 0:00 0 24 20 xx 0 0 0.0 0.0 kproc
2454 - A 1:32 62 272 224 xx 96 60 0.0 0.0 /usr/dt/b

Collect ps vg output at different instances throughout the period of time that %Used from lsps -s grows to 99%. The output can then be examined for large numerical increases from the SIZE column. This process would exhibit extraordinarily large increases in the amount of paging space it uses between the two ps vg readings.

There is a tool that creates delta reports of ps vg over any designated period of time. The script is called ps_ and is located in /usr/sbin/perf/pmr. It is only available at AIX 4.1.x and above. This tool is not installed by default. The fileset name is bos.perf.pmr and can be installed from install media. This ps_script is run with the following syntax:
ps_ <#seconds to run>

It takes a ps vg snapshot at the beginning and end of a designated time period and creates a delta report (final values minus initial values). The output file for ps_ is called ps.sum and is created in /var/perf/tmp. For example, a system user notices the %Used value from lsps -s rises from 40% to 80% in a few hours, eventually reaching 99% and freezing all activity on the system. The user realizes that this is not normal and that there may be a memory leak at hand. Running ps_ 600 every half hour during the time paging space became consumed would most likely reveal the process causing the memory leak. The following is a sample reading of ps_ (as seen below from ps.sum):

DELTA DELTA DELTA DELTA DELTA DELTA BEFORE AFTER
PID PGIN SIZE RSS TRS DRS C TIME TIME CMD
0 0 0 0 0 0 0 10:58 10:58 swapper
1 0 0 0 0 0 -1 71:31 71:31 init
516 0 0 0 0 0 0 17136:33 17137:29 kproc
50328 1 78 -124 0 -124 1 0:00 0:00 ksh
50450 0 0 0 0 0 0 0:00 0:00 telnetd
50724 0 -20 0 0 0 0 0:29 0:29 ttsession
53746 0 0 0 0 0 0 0:00 0:00 ksh

From the DELTA SIZE column, we can see that PID 50328 allocated 78K of paging space during the time ps_ was run. PID 50724, however, deallocated 20K of paging space during this time and any process showing zero indicates that it allocated no paging space.

b.Another tool that can be used to track a memory leak is svmon. NOTE: PAIDE/6000 must be installed in order use svmon (and others, such as tprof, netpmon, and filemon). To check if this is installed, enter: lslpp -1 perfagent.tools.

If you are at AIX Version 4.3.0 or higher, this file can be found on the AIX Base Operating System media. Otherwise, to order PAIDE/6000, contact your AIX support center.

As root, enter the following command:
svmon -Pau 10 | more
This will list the top 10 memory consumers in decreasing order, the first process being the largest consumer. The rest of the report shows memory and paging space usage for each segment of each process. Sample output looks like the following:


Pid Command Inuse Pin Pgspace
13794 dtwm 1603 1 449
Pid: 13794
Command: dtwm
Segid Type Description Inuse Pin Pgspace Address Range
b23 pers /dev/hd2:24849 2 0 0 0..1
14a5 pers /dev/hd2:24842 0 0 0 0..2
6179 work lib data 131 0 98 0..891
280a work shared library text 1101 0 10 0..65535
181 work private 287 1 341 0..310:65277..65535
57d5 pers code,/dev/hd2:61722 82 0 0 0..1

In each process report, find items in the Type column identified as work and in the Description column identified as private, and check how many 4KB(4096-byte)pages are used under the Pgspace column. This is the minimum number of working pages this segment is using in all of virtual memory. A Pgspace number that grows but never decreases may indicate a memory leak.

3.The system may be reaching its Maximum number of PROCESSES allowed per user, or maxuproc. Depending on what maxuproc is set to (default is 40), if a user has already forked a number of processes equal to maxuproc,the system will not allow that user to fork any more processes. The maxuproc parameter can be increased via SMIT. Enter SMIT and proceed in sequence through the panels System Environments and then Change / Show Characteristics of the Operating System. The first line on this last screen is maxuproc. Increasing this number by a conservative increment (50-100 at a time) allows users to fork more processes, thus avoiding any Out of memory or Cannot fork messages.

Check the errpt -a | more (to look for entries that may show that the system is busy with tty overrun or hogs? Is there missing hardware?

df (is the system full?)

no -a | more and look for &quot;thewall&quot; what is it set at?

microcode has caused problems on some machines, check for the latest on your machine.

On the same site you can also check for the latest patches for your AIX.
Microcode
Common filesets that hang machines are bos.up, bos.mp,
bos.net.tcp,bos.rte.tty I would be sure that I had the
latest and greatest with prereqs and coreqs before I checked further.

Most patches require a still system with no one one it. You must be in multiuser mode, but without users. Also, most patches require a reboot of the system after they are applied.
You can run the instfix command to see what level of aix you are at:
instfix -i | grep ML or
instfix -ik 4320-02_AIX_ML
instfix -ik 4330-01_AIX_ML
if none of the 4210-0x_AIX_ML are found
your AIX level is 4.2.1.0

Get the latest patches and recommended maintenance levels for your operating system.
Do an lppchk -v and a lppchk -c -m3 [fileset]
diag -a
Any broken filesets?
errpt -a | more
Any errors?

Try to get a more accurate history of when these hangs started and what changed. It is really hard to track down problems. You can put script files in cron to check the paging space, and the ps and other commands to see what is happening before the machine hangs.....but adding recommended maintenance and micro code will also help.
#-)
 
Promise 1,

If you have a support contract with IBM it may be worth forcing a system dump at the time of system hang to send for analysis:

In order to this set the following option:

smitty dump

ALWAYS ALLOW SYSTEM DUMP = true

Then when the server hangs again press ctrl-alt-1 and a dump will be created. The dump is a snapshot of the kernel when the system hangs and so should show what is causing the server to hang....if you want to get your hands dirty you can use dbx and crash to interrogate the dump, or just create a snap tape to send to IBM:

snap -a -o /dev/rmt0

Hope that gives you another avenue to eplore

PSD
HACMP Specialist
 
Hi all , read your suggestions. I am currently noting the system status in order to get the ratings when the hanging occurs.

total paging space is 564MB (1% utilisation)

iostat gives waiting time of 99% normally but 56% during hanging some times the %usr or %sys accounts for this.

i'll keep-in-touch

Promises
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top