Unexpected System Halt 1

paulywog · Apr 8, 2002

I'll apologize up front for the amt. of data, but my RS/6000 has done this a couple of times and I'm not sure where to start. Here is part of the errpt:

SYSDUMP_SYMP
3753A829
Resource Name: CMDCRASH

Unexpected System Halt

Perform problem determination procedures

Detail Data
Dump Status
LED:300
csa:002cceb0
[hd_pin_bot:hd_stragegy] 98
[hd_pin_bot:hd_stragegy] 7c
devstrat 128
v_pdtsio 598
v_pfend 268
iodone_off1 24
i_softmod 1ac
flih_603_patch C0

System data
Reportable
1
Internal Error
1
Sympton Code
PIDS/57656550 LVLS/420 PCSS/SPI1 MS/300 FLDS/[hd_pin_b VALU/906 5000C FLDS/devstrat VALU/128

Users told me they locked up and when I got to the server it was already rebooting. Came back up and seems to be running okay.

Any input to whats going on would be most appreciated!

Thanks.

MoshiachNow · Apr 8, 2002

Analyzing system dump
=============================
If the customer complains that his system had frozen with 888 on the display, check errpt for the entry like this:
C0AA5338 0614145601 U S SYSDUMP SYSTEM DUMP

This means that the system dump have occurred on 14 of June at 14:56.

Run the following command to verify the status of the last system dump:

# sysdumpdev -L

0453-039

Device name: /dev/hd6
Major device number: 10
Minor device number: 2
Size: 63952384 bytes
Date/Time: Thu Jun 14 14:43:11 CST 2001
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0

Run the crash command in order to get a basic idea on the possible reasons of the system dump.
The crash subcommands (trace -k, thread -r, status 0) are used to provide a hint on the problem origin:

#cd /var/adm/ras
#crash vmcore.0

Using /unix as the default namelist file.

> trace -k
STACK TRACE:
0x2ff3b400 (excpt=edffff54:40000000:00001004:edffff54:00000106) (intpri=0)
IAR: .remove_e_list+38 (00032888): tweqi r7,0x0
LR: .e_block_thread+40c (00034424)
2ff3b010: .e_sleep_thread+4c (0003497c)
2ff3b060: .[nspdd]+4144 (016ba4e4)
2ff3b100: .[nspdd]+2de4 (016b9184)
2ff3b170: .[nspdd]+7e8 (016b6b88)
2ff3b1f0: .rdevioctl+140 (001b4344)
2ff3b260: .vnop_ioctl+1c (001c01d4)
2ff3b2a0: .vno_ioctl+144 (001d81d8)
2ff3b360: .common_ioctl+b0 (001e7894)
2ff3b3c0: .sys_call_ret+0 (00003a90)
IAR not in kernel segment.

> status 0

CPU TID TSLOT PID PSLOT STOPPED PROC_NAME
0 700f 112 6db0 109 yes pltDc

> thread -r

SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME FLAGS
2 r 205 204 0 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
3 r 307 306 1 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
4 r 409 408 2 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
5 r 50b 50a 3 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
112 r 700f 6db0 0 RR 40 0 pltDc
t_flags: local cdefer funnel

> proc -r

SLT ST PID PPID PGRP UID EUID TCNT NAME
2 a 204 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
3 a 306 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
55 a 37b8 2282 2282 200 200 1 X
FLAGS: swapped_in execed
112 a 7054 571a 25c8 200 200 1 expose
FLAGS: swapped_in no_swap fixed_pri ppnocldstop execed
122 a 7a14 1 744c 200 200 1 plateExp_dlg35
FLAGS: swapped_in orphanpgrp ppnocldstop execed

>q ;quits the crash command
=================================================================
In this case trace -k shows a problem with nspdd process, which is part of the TSP driver.
thread -r and status 0 both hint on the application process pltDc as responsible for the core dump (it's the last process that run).
"Long live king Moshiach !"

Yegolev · Apr 9, 2002

levw -- good info, thanks.

I seem to remember LED 300 being really bad, but I cannot remember if it is hardware or software. Call IBM and get their opinion.

MoshiachNow · Apr 9, 2002

As well:
================================================
1. Diagnosing the errlog
Run: diag -> Diagnostics routines -> Problem determination
This selection tests the system and analyzes the error log if one is available.
The output may be:
------------------------------------------------------------------------A PROBLEM WAS DETECTED ON Tue Jun 19 12:59:09 WET 2001 801014
The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):

A03-150: I/O Expansion Bus Connection Failure.
Error log information:
Sequence number: 30
n/a FRU: n/a P2
n/a FRU: n/a 1
n/a FRU: n/a P1
n/a FRU: n/a P1
------------------------------------------------------------------------This would normally mean a HW motherboard problem.

2. Running diag via smit
We can further run a physical motherboard diag:
Run: diag -> Diagnostics routines -> System verification ->choose resource
This selection runs the system HW diag on the selected resources :
sysplanar0 00-00 System Planar
oppanel 00-00 Operator panel
mem0 00-00 Memory
proc0 00-00 Processor
L2cache0 00-00 L2 Cache
proc2 00-02 Processor.
And more…

3. Running manual command
diag -d sysplanar0

TESTING COMPLETE on Mon Jul 16 10:49:40 WET 2001 801010
No trouble was found.
The resources tested were:

- sysplanar0 00-00 System Planar

"Long live king Moshiach !"

avadh · Apr 16, 2002

Hello,
1) dump with 300 - caused by errors with memory.

If sysdumpdev -L shows a succesful dump then call IBM 1-800-call-aix and have them create pmr for system crash and they will tell you to run snap command and rename the output as pmr number and put on ftp site testcase.software.ibm.com/aix/toibm.

mrn · Apr 16, 2002

Looks like a DSI to me, was there anything else on the LED display? What model server is it? --
| Mike Nixon
| Unix Admin
|

http://www.instantflashshop.com

----------------------------

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Unexpected System Halt 1

paulywog

IS-IT--Management

MoshiachNow

IS-IT--Management

Yegolev

Technical User

MoshiachNow

IS-IT--Management

avadh

Technical User

mrn

MIS

Similar threads

Part and Inventory Search

Sponsor