Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

System dumps with AIX 4.3.3 ML10

Status
Not open for further replies.

MoshiachNow

IS-IT--Management
Feb 6, 2002
1,851
IL
Hi,

Since we have installed ML10 we experience frequent system dumps,which are related to our own designed PCI boards (with our own drivers).

However,all was stable with ML8.
I know that AIX PCI drivers were altered in ML10 - anybody has the same experience ? "Long live king Moshiach !"
h
 
HI,

What we see is some problem with PLX chip on our board that controls the pci BUS.

tHE CHIP IS A STANDARD one,and our drivers do not deal with PCI interface at all.
ALl the problems have started only after loading ML10. "Long live king Moshiach !"
h
 
Hi,

We have two systems in hacmp that are more 'unstable' since last upgrade (to ML10 but from a more unknown level). But the dumps are never the same and they occur on two servers.
I am not enough familiar with dump analysis, but are all your dumps similars. btw, if you know good doc for dump analysis - different from the redbooks - i accept it !

Thanks
 
Gileb,

Please find below the way I'm personally using for a fast and efficient analysis for the system dump cause.
I would appreciate if you will use the below to analyze your system dumps and post the results here:
================================
1.1 Analyzing system dump
If the customer complains that his system had frozen with 888 on the display, check errpt for the entry like this:
C0AA5338 0614145601 U S SYSDUMP SYSTEM DUMP

This means that the system dump have occurred on 14 of June at 14:56.

Run the following command to verify the status of the last system dump:

# sysdumpdev -L

0453-039

Device name: /dev/hd6
Major device number: 10
Minor device number: 2
Size: 63952384 bytes
Date/Time: Thu Jun 14 14:43:11 CST 2001
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0

Run the crash command on AIX 4.3.3/4.2.1 or kdb command on AIX5 in order to get a basic idea on the possible reasons of the system dump.
The crash subcommands (trace -k, thread -r, status 0) are used to provide a hint on the problem origin:

#cd /var/adm/ras
#crash vmcore.0

Using /unix as the default namelist file.

> trace -k
STACK TRACE:
0x2ff3b400 (excpt=edffff54:40000000:00001004:edffff54:00000106) (intpri=0)
IAR: .remove_e_list+38 (00032888): tweqi r7,0x0
LR: .e_block_thread+40c (00034424)
2ff3b010: .e_sleep_thread+4c (0003497c)
2ff3b060: .[nspdd]+4144 (016ba4e4)
2ff3b100: .[nspdd]+2de4 (016b9184)
2ff3b170: .[nspdd]+7e8 (016b6b88)
2ff3b1f0: .rdevioctl+140 (001b4344)
2ff3b260: .vnop_ioctl+1c (001c01d4)
2ff3b2a0: .vno_ioctl+144 (001d81d8)
2ff3b360: .common_ioctl+b0 (001e7894)
2ff3b3c0: .sys_call_ret+0 (00003a90)
IAR not in kernel segment.

> status 0

CPU TID TSLOT PID PSLOT STOPPED PROC_NAME
0 700f 112 6db0 109 yes pltDc

> thread -r

SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME FLAGS
2 r 205 204 0 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
3 r 307 306 1 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
4 r 409 408 2 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
5 r 50b 50a 3 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
112 r 700f 6db0 0 RR 40 0 pltDc
t_flags: local cdefer funnel


> proc -r

SLT ST PID PPID PGRP UID EUID TCNT NAME
2 a 204 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
3 a 306 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
55 a 37b8 2282 2282 200 200 1 X
FLAGS: swapped_in execed
112 a 7054 571a 25c8 200 200 1 expose
FLAGS: swapped_in no_swap fixed_pri ppnocldstop execed
122 a 7a14 1 744c 200 200 1 plateExp_dlg35
FLAGS: swapped_in orphanpgrp ppnocldstop execed

>q ;quits the crash command
=================================================================
In this case trace -k shows a problem with nspdd process, which is part of the TSP driver.
thread -r and status 0 both hint on the application process pltDc as responsible for the core dump (it's the last process that run). "Long live king Moshiach !"
h
 
hi levw,

In our case, we seem to have a dump while java is writing a core (so maybe its a corrupted fs) but in an another, ibm does not have a clue... they asked us to activate the memory overlay detection.

> t -k
STACK TRACE:
0xf0008720 (excpt=40c20004:40000000:60038bdc:40c20004:00000106) (intpri=11)
IAR: .comfail+30 (001af23c): tweqi r0,0x0
LR: .itrunc+238 (00243200)
f06d0b80: .itrunc+238 (00243200)
f06d0ce0: .ip_open+120 (00212568)
f06d0d40: .jfs_open+140 (0021288c)
f06d0da0: .vnop_open+1c (001c1b60)
f06d0de0: .openpnp+36c (001d96e0)
f06d0ec0: .openpath+9c (001d9918)
f06d1330: .fp_open+a8 (001d9d80)
f06d1390: .open_corefile+260 (00091584)
f06d1490: .corex+1c4 (00091974)
f06d1bb0: .psig+2c4 (00029b68)
f06d1c10: .issig+c8 (00029d64)
f06d1c90: .sig_deliver+e0 (00029848)
IAR not in kernel segment.

>
> thread -r
SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME
2 r 205 204 0 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
3 r 307 306 1 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
16 r 1077 1258 unbound other 3c 0 errdemon
t_flags: wakeonsig
241 r f1e5 5228 unbound other 73 6e java
t_flags: cdefer sig_avail
714 r 2cacf 7e0a unbound other 78 78 logger
t_flags:
>
> proc -r
SLT ST PID PPID PGRP UID EUID TCNT NAME
2 a 204 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
3 a 306 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
18 a 1258 1 1258 0 0 1 errdemon
FLAGS: swapped_in orphanpgrp signochld
82 a 5228 5fd2 5228 30004 30004 63 java
FLAGS: swapped_in locks continued execed
126 a 7e0a 63f4 463e 0 0 1 logger
FLAGS: swapped_in orphanpgrp execed
>

> stat
sysname: AIX
nodename: xsagence1
release: 3
version: 4
machine: 005731AA4C00
time of crash: Thu Nov 21 18:19:21 2002
age of system: 1 day, 10 hr., 15 min.
xmalloc debug: disabled
dump code: 700
csa: 0xf0008720
exception struct:
0x00a00000 0x00000000 0x00000000 0x00000000 0x00000000
>

------------------------------------------------

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top