Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cause of system crashes. 3

Status
Not open for further replies.

RonnyDonny

IS-IT--Management
Mar 28, 2002
5
US
We have experienced system crashes 3 days in a row. We are running AIX 4.1.5. I have dump files created from the snap command from the previous 2 crashes. Is there some way I can interpret the dump files to determine the cause of the crashes.

In errpt the label for the crash is NONE.
Any help would be appreciated.
Thanks,
Don
 
ftp the dumps to IBM, the only way I know to decipher them. Too bad 4.1.5 is no longer supported. Maybe someone else knows how to do it.

The old-fashioned way, try to determine what may have changed to cause the crashes. Also try "lppchk -v", run diags, look for disk problems, etc.

FYI, don't run 4.3 diags on it, you might hose the box.

Good luck.
 
HI,

1.Run diag :
=======================
AIX diag is a useful tool that can help analyzing HW problems.
Here are some examples of how it can be used.
The following error in errpt can be HW or SW oriented.
We can use diag in order to eliminate the HW issue with a motherboard:
1.1 Diagnosing the errlog
Run: diag -> Diagnostics routines -> Problem determination
This selection tests the system and analyzes the error log if one is available.
The output may be:
------------------------------------------------------------------------A PROBLEM WAS DETECTED ON Tue Jun 19 12:59:09 WET 2001 801014
The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):

A03-150: I/O Expansion Bus Connection Failure.
Error log information:
Sequence number: 30
n/a FRU: n/a P2
n/a FRU: n/a 1
n/a FRU: n/a P1
n/a FRU: n/a P1
------------------------------------------------------------------------This would normally mean a HW motherboard problem.
1.2 Running diag via smit
We can further run a physical motherboard diag:
Run: diag -> Diagnostics routines -> System verification ->choose resource
This selection runs the system HW diag on the selected resources :
sysplanar0 00-00 System Planar
oppanel 00-00 Operator panel
mem0 00-00 Memory
proc0 00-00 Processor
L2cache0 00-00 L2 Cache
proc2 00-02 Processor.
And more…
1.3 Running manual command
diag -d sysplanar0

TESTING COMPLETE on Mon Jul 16 10:49:40 WET 2001 801010
No trouble was found.
The resources tested were:

- sysplanar0 00-00 System Planar
========================================================
2.Analyze system dump:
=====================
2.1 Analyzing system dump
If the customer complains that his system had frozen with 888 on the display, check errpt for the entry like this:
C0AA5338 0614145601 U S SYSDUMP SYSTEM DUMP

This means that the system dump have occurred on 14 of June at 14:56.

Run the following command to verify the status of the last system dump:

# sysdumpdev -L

0453-039

Device name: /dev/hd6
Major device number: 10
Minor device number: 2
Size: 63952384 bytes
Date/Time: Thu Jun 14 14:43:11 CST 2001
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0

Run the crash command in order to get a basic idea on the possible reasons of the system dump.
The crash subcommands (trace -k, thread -r, status 0) are used to provide a hint on the problem origin:

#cd /var/adm/ras
#crash vmcore.0

Using /unix as the default namelist file.

> trace -k
STACK TRACE:
0x2ff3b400 (excpt=edffff54:40000000:00001004:edffff54:00000106) (intpri=0)
IAR: .remove_e_list+38 (00032888): tweqi r7,0x0
LR: .e_block_thread+40c (00034424)
2ff3b010: .e_sleep_thread+4c (0003497c)
2ff3b060: .[nspdd]+4144 (016ba4e4)
2ff3b100: .[nspdd]+2de4 (016b9184)
2ff3b170: .[nspdd]+7e8 (016b6b88)
2ff3b1f0: .rdevioctl+140 (001b4344)
2ff3b260: .vnop_ioctl+1c (001c01d4)
2ff3b2a0: .vno_ioctl+144 (001d81d8)
2ff3b360: .common_ioctl+b0 (001e7894)
2ff3b3c0: .sys_call_ret+0 (00003a90)
IAR not in kernel segment.

> status 0

CPU TID TSLOT PID PSLOT STOPPED PROC_NAME
0 700f 112 6db0 109 yes pltDc

> thread -r

SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME FLAGS
2 r 205 204 0 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
3 r 307 306 1 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
4 r 409 408 2 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
5 r 50b 50a 3 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
112 r 700f 6db0 0 RR 40 0 pltDc
t_flags: local cdefer funnel


> proc -r

SLT ST PID PPID PGRP UID EUID TCNT NAME
2 a 204 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
3 a 306 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
55 a 37b8 2282 2282 200 200 1 X
FLAGS: swapped_in execed
112 a 7054 571a 25c8 200 200 1 expose
FLAGS: swapped_in no_swap fixed_pri ppnocldstop execed
122 a 7a14 1 744c 200 200 1 plateExp_dlg35
FLAGS: swapped_in orphanpgrp ppnocldstop execed

>q ;quits the crash command
=================================================================
In this case trace -k shows a problem with nspdd process, which is part of the TSP driver.
thread -r and status 0 both hint on the application process pltDc as responsible for the core dump (it's the last process that run).
"Long live king Moshiach !"
 
Hello,
just to let you know AIX 4.1.5 is out of support. If you are running the latest level of AIX 4.1.5 code that is out there. Then even looking at the dump and knowing what cause the crash, IBM is not going to make any more code change to fix the crash you are experiencing. It will be good idea to upgrade the AIX.
 
True,but normally HW is the cause of the system dumps... "Long live king Moshiach !"
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top