Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations sizbut on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Crazy 8's on mirrored hd6 system 1

Status
Not open for further replies.

LinuAIX

MIS
Jun 5, 2003
53
US
Hi All,

Our M80 running AIX 433ML9 went:

888 102 300 0c0
888 102 605 0c5

After the reboot I found sysplanar errors, hdisk errors, and LVM stale pp errors. System is on hdisk0 mirrored to hdisk1. The paging space was mirrored. IBM Says that the system was paging out and hit a bad block on hdisk0 and crashed.

The question is why the mirror on hdisk1 did not keep this crash from happening.

Thanks in advance

Larry Bennett
Technowennie
 
What's the lsvg rootvg output look like?

What's the errpt -a output look like?

Run crash on the kernel and post the output.
 
Here is the `errpt -a` output. The dump is on a tape I'll need to transfer to the system for the crash analysis portion.


---------------------------------------------------------------------------
LABEL: LVM_SA_STALEPP
IDENTIFIER: EAA3D429

Date/Time: Wed Oct 29 09:04:50
Sequence Number: 764
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: S
Type: UNKN
Resource Name: LVDD

Description
PHYSICAL PARTITION MARKED STALE

Detail Data
PHYSICAL VOLUME DEVICE MAJOR/MINOR
000E 0002
PHYSICAL PARTITION NUMBER (DECIMAL)
222
LOGICAL VOLUME DEVICE MAJOR/MINOR
000A 0006
SENSE DATA
0004 B4FF DAD8 57D4 0000 0000 0000 0000 0004 B4FF A2AF A53A 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL: LVM_IO_FAIL
IDENTIFIER: 613E5F38

Date/Time: Wed Oct 29 09:04:50
Sequence Number: 763
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: H
Type: PERM
Resource Name: LVDD
Resource Class: NONE
Resource Type: NONE
Location: NONE

Description
I/O ERROR DETECTED BY LVM

Probable Causes
POWER, DRIVE, ADAPTER, OR CABLE FAILURE

Recommended Actions
RUN DIAGNOSTICS AGAINST THE FAILING DEVICE

Detail Data
PHYSICAL VOLUME DEVICE MAJOR/MINOR
000E 0002
ERROR CODE AS DEFINED IN sys/errno.h
5
BLOCK NUMBER
14509615
LOGICAL VOLUME DEVICE MAJOR/MINOR
000A 0006
PHYSICAL BUFFER TRANSACTION TIME
0
SENSE DATA
0000 DD66 0004 B4FF DAD8 57D4 0000 0000 0000 0000 0004 B4FF DAEB 1874 0000 0000
0000 0000
---------------------------------------------------------------------------
LABEL: DISK_ERR2
IDENTIFIER: A668F553

Date/Time: Wed Oct 29 09:04:50
Sequence Number: 762
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: H
Type: PERM
Resource Name: hdisk0
Resource Class: disk
Resource Type: scsd
Location: 40-60-00-4,0
VPD:
Manufacturer................IBM
Machine Type and Model......DDYS-T18350N
FRU Number..................07N3776
ROS Level and ID............53395241
Serial Number...............4EG3E146
EC Level....................F79924
Part Number.................07N3811
Device Specific.(Z0)........000003029F00013A
Device Specific.(Z1)........07N4921
Device Specific.(Z2)........0933
Device Specific.(Z3)........00347
Device Specific.(Z4)........0001
Device Specific.(Z5)........22
Device Specific.(Z6)........F79924

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A04 0000 2A00 00DD 662F 0000 0800 0000 0102 0000 7000 0B00 0000 0018 0000 0000
4700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000A D50A 0008 D780
---------------------------------------------------------------------------
LABEL: DISK_ERR4
IDENTIFIER: 1581762B

Date/Time: Wed Oct 29 09:04:50
Sequence Number: 761
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: H
Type: TEMP
Resource Name: hdisk0
Resource Class: disk
Resource Type: scsd
Location: 40-60-00-4,0
VPD:
Manufacturer................IBM
Machine Type and Model......DDYS-T18350N
FRU Number..................07N3776
ROS Level and ID............53395241
Serial Number...............4EG3E146
EC Level....................F79924
Part Number.................07N3811
Device Specific.(Z0)........000003029F00013A
Device Specific.(Z1)........07N4921
Device Specific.(Z2)........0933
Device Specific.(Z3)........00347
Device Specific.(Z4)........0001
Device Specific.(Z5)........22
Device Specific.(Z6)........F79924

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

Recommended Actions
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

Recommended Actions
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A04 0000 2A00 00DD 662F 0000 0800 0000 0102 0000 7000 0B00 0000 0018 0000 0000
4700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000A D50A 0008 D780
---------------------------------------------------------------------------
LABEL: LVM_SA_STALEPP
IDENTIFIER: EAA3D429

Date/Time: Wed Oct 29 09:04:49
Sequence Number: 760
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: S
Type: UNKN
Resource Name: LVDD

Description
PHYSICAL PARTITION MARKED STALE

Detail Data
PHYSICAL VOLUME DEVICE MAJOR/MINOR
000E 0002
PHYSICAL PARTITION NUMBER (DECIMAL)
218
LOGICAL VOLUME DEVICE MAJOR/MINOR
000A 0003
SENSE DATA
0004 B4FF DAD8 57D4 0000 0000 0000 0000 0004 B4FF A2AF A53A 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL: LVM_IO_FAIL
IDENTIFIER: 613E5F38

Date/Time: Wed Oct 29 09:04:49
Sequence Number: 759
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: H
Type: PERM
Resource Name: LVDD
Resource Class: NONE
Resource Type: NONE
Location: NONE

Description
I/O ERROR DETECTED BY LVM

Probable Causes
POWER, DRIVE, ADAPTER, OR CABLE FAILURE

Recommended Actions
RUN DIAGNOSTICS AGAINST THE FAILING DEVICE

Detail Data
PHYSICAL VOLUME DEVICE MAJOR/MINOR
000E 0002
ERROR CODE AS DEFINED IN sys/errno.h
5
BLOCK NUMBER
14283672
LOGICAL VOLUME DEVICE MAJOR/MINOR
000A 0003
PHYSICAL BUFFER TRANSACTION TIME
0
SENSE DATA
0000 D9F3 0004 B4FF DAD8 57D4 0000 0000 0000 0000 0004 B4FF DAEB 1874 0000 0000
0000 0000
---------------------------------------------------------------------------
LABEL: DISK_ERR2
IDENTIFIER: A668F553

Date/Time: Wed Oct 29 09:04:49
Sequence Number: 758
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: H
Type: PERM
Resource Name: hdisk0
Resource Class: disk
Resource Type: scsd
Location: 40-60-00-4,0
VPD:
Manufacturer................IBM
Machine Type and Model......DDYS-T18350N
FRU Number..................07N3776
ROS Level and ID............53395241
Serial Number...............4EG3E146
EC Level....................F79924
Part Number.................07N3811
Device Specific.(Z0)........000003029F00013A
Device Specific.(Z1)........07N4921
Device Specific.(Z2)........0933
Device Specific.(Z3)........00347
Device Specific.(Z4)........0001
Device Specific.(Z5)........22
Device Specific.(Z6)........F79924

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A04 0000 2A00 00D9 F398 0000 0800 0000 0102 0000 7000 0B00 0000 0018 0000 0000
4700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000A D50A 0008 D780
---------------------------------------------------------------------------
LABEL: DISK_ERR4
IDENTIFIER: 1581762B

Date/Time: Wed Oct 29 09:03:49
Sequence Number: 757
Machine Id: 0004B4FF4C00
Node Id: unixetl
Class: H
Type: TEMP
Resource Name: hdisk0
Resource Class: disk
Resource Type: scsd
Location: 40-60-00-4,0
VPD:
Manufacturer................IBM
Machine Type and Model......DDYS-T18350N
FRU Number..................07N3776
ROS Level and ID............53395241
Serial Number...............4EG3E146
EC Level....................F79924
Part Number.................07N3811
Device Specific.(Z0)........000003029F00013A
Device Specific.(Z1)........07N4921
Device Specific.(Z2)........0933
Device Specific.(Z3)........00347
Device Specific.(Z4)........0001
Device Specific.(Z5)........22
Device Specific.(Z6)........F79924

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

Recommended Actions
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

Recommended Actions
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A04 0000 2A00 00DD 5100 0000 0800 0000 0102 0000 7000 0B00 0000 0018 0000 0000
4700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000A D50A 0008 2780
 
When you run crash use the following subcommands. Also print the lsvg rootvg. And what is the device with major/minor numbers 10,6 and 10,3?

crash /dev/lv00 (or whatever)
stat
status
cpu
pslot -e [on pslot value returned from status and selected cpu]
trace -m
trace -k
trhead -r
proc -r
errpt
symptom
od prog_log 8
od vmmerrlog 9 a
 
Actualy.looks to me like IBM are right,in this case.

Mirroring is good for data,of if the system disk dies totally,then you reboot and the mirror disk does the job.

However,if the system runs into a bad block on the system disk during paging (read/write - in this case the error was 4700 - means parity error) -it will never survive.
It will not know to switch over to the mirror system disk in the middle of the paging,which is one of the basic kernel operations.

"Long live king Moshiach !"
 
Hey Levw,

Thanks for confirming what I though! I have been having heated discussions with fellow AIXers about mirroring paging space and we all needed to wait for a good crash to prove one right and the other wrong.

LinuAIX
 
That is an incorrect statement. If the disk is mirrored then the bit is set and it is marked stale, as shown in the error report. At this time the mirrored disk would take over activity, however, why hdisk1 did not continue to operate is a question for IBM to answer.
 
levw,

Mirroring is good for data,of if the system disk dies totally,then you reboot and the mirror disk does the job.

are you saying that if you have rootvg on hdisk0, mirrored completely to hdisk1, if hdisk0 fails, you have to boot so that hdisk1 takes over?
 
No,not always.

But I do mean that it depends on the damage to the hdisk0,that in some cases the kernel will not manage to recover by switching automatically to hdisk1,thus crashing with a system dump.

"Long live king Moshiach !"
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top