Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Hdisk Errors 1

Status
Not open for further replies.

Mag0007

MIS
Feb 15, 2005
829
0
0
US
Hi All,

I have a system hooked up to the SAN, and I keep getting "Disk Operation Errors"

LABEL: SC_DISK_ERR2
IDENTIFIER: 79B0DF89
Type: PERM
Resource Type: CLAR_FC_LUNZ
Description
DISK OPERATION ERROR

Any idea whats going on with this?

TIA

 
Somewhere I have a doc on how to decode these, please post the sense data, all the numbers at the bottom of the errpt entry, and I'll go look for the doc.
See you shortly....
 
is that doc available on the web? I too get these errors and have not had much luck in pinpointing the cause.

 
If you head on over to the pSeries Documentation site you will find in the left pane AIX Message Center. This will drop down and let you select "Error identifiers" which you can put your info in.

Site: IBM Docs

Ethan
 
I can't find anything of use on the web either.
Can't remember where I got it but it was probably from an IBM tech on a course or whilst we were working on a problem.
I'd rather not post it in case it gets traced back to the guy that gave it to me because if it is not on the web it probably isn't a public doc and so he probably wasn't meant to let me have a copy and it doesn't seem fair to drop him in it.
It is not an easy doc to understand either so although I'm happy to try and help I think it might give people false hope if I just post it up and they try to make their own conclusions from it.
You might see sc_disk_err's for all sorts of reasons that may or may not be meaningful, you'll need to check if there are other associated errors that are the real cause, in which case the sc_disk errors are the symtom - not the cause.
For instance you might have an adapter go down, and then see lots of sc_disk errors.
If you decode the sc_disk errors they will tell you the disks are off line and so you go hassle the san guys, but the cause was your HBA failed.
Check your errpt for the time the problam started, if you have HBA errors for the adapter, fcs, or the interface, fscsi, then work on them. If you get sc_disk errors out of the blue (pun intended) then it is probably a san or storage device problem.
One day only offer, the wife is away, I have plenty of cider - so the quality of my replies is likely to decline with time / cider - make the most of it untill I get bored / too pissed :-D
 
That's the east coast gone then and time is sweeping across the nation.
I'm off to watch TV.
If I'm still upright I'll check in later.....
One day only.....
 
Is this error at the same time each day if so ECC scan may be the problems it has been a known issue
 
DukeSSD:

Something like this?

Any help is appreciated

Code:
LABEL:          SC_DISK_ERR2
IDENTIFIER:     79B0DF89

Date/Time:       Fri Oct  6 08:00:40 2006
Sequence Number: 10856
Machine Id:      000C40FF4C00
Node Id:         node11t
Class:           H
Type:            PERM
Resource Name:   hdisk32
Resource Class:  disk
Resource Type:   CLAR_FC_LUNZ
Location:        1n-08-01
VPD:
        Manufacturer................DGC
        Machine Type and Model......LUNZ
        ROS Level and ID............0219
        Serial Number...............APM00050204229
        Device Specific.(SI)........CX700
        Device Specific.(PQ)........00
        Device Specific.(VS)........0000000000CL
        Device Specific.(UI)........500601609060115B500601609060115B
        Device Specific.(FL)........FFFF
        Device Specific.(Z0)........10
        Device Specific.(Z1)........10

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0102 0000 7000 0200
0000 000A 0000 0000 0403 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000
 
I'm having the same issue after zoning my aix host to a cx700 array. In the EMC knowledgebase, the problem has been identified as diagela recognizing the clariion's passive path as an issue.

On EMC's site, they recommend either disabling diagela for the box (since powerpath monitors path failures), or the specific devices. Since we have a mix of local disk and san disk, I don't want to shut it down completely. I'd like to specify the Clariion devices

The question is how do you do that? The diagela command exists under /usr/lpp/diagnostics/bin. If you use a -h flag, it gives you:

diagela {ENABLE | DISABLE [-t|-R <device> [ <device> ]]}

I can't find what the -t or the -R flags are for anywhere.

If anyone knows a little more about this, I'd definitely appreciate the help.
 
Plamb, jlymer,
I think your problems are similar to each others but different to this one - start a new thread guys.
Mag0007,
From what I can tell, the host can see and configure the device hdisk32 but when it tries to send it scsi commands the commands fail with a check condition of the device not being ready to process those scsi commands.
So it seems to be a problem in the clarion, rather than the host or the san.
I think you need to talk to EMC about this rather then the SAN guys.
Tell them you can configure the disk OK but when you try to use it it returns a scsi status of "device not ready". (rather than the issue Plamb and jlymer have which is that another path or system has control of the disk - I suspect they have 0118 near the end of the first line where you have 0102)
From experience they will take up to a week to sort this out for you....
Did you have data on the drive?
Do you have a backup?
Can you get the to configure you a new disk and then restore to that while they sort this lun out?
 
You may first want to have the SAN guys check that the LUN is still being properly presented to your AIX host. Perhaps it is behind an adapter on the EMC box that someone has disconnected, unconfigured, ... Or has any zoning info been modified lately?

Is it just this one LUN which is giving you trouble?


HTH,

p5wizard
 
DukeSSD & P5wizard:

Unfortunately, there are other hdisks that show the same exact error.

I do have a backup but this is a large production box. Each LUN is about 256Gig, and I don't think I have the option to remove and restore. I wish I did :)

I will ask the SAN guys to see if there is a LUN problem, and if they zoned everything today, and if there are any zoing changes.
 
Can you still get at the data? I.e. are other paths to these LUNs still available? I don't know the powerpath commands, but on MPIO disks, I do "lspath" or "pcmpath query device" and on IBM SDD vpaths I do "datapath query device" to check the multipath device access.

Perhaps somewhere in the SAN there is a failed ISL or failed EMC controller adapter which is giving you this trouble.


HTH,

p5wizard
 
p5wizard:

Its saturday here, I will ask them on Monday. What is an "ISL" ?

I am using powerpath, most likely I have to do 'powermt display'

TIA
 
ISL: InterSwitchLink

I'd ask the SAN guys to check out all the SAN switches and the links between them and also the links to the EMC boxes.


HTH,

p5wizard
 
0600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0102 0000 7000 0200
0000 000A 0000 0000 0403 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Root Cause: Any command or application trying to access a device that is in a NOT READY state to the host can potentially see these errors.
For example the CLARiiON array is an active/passive device and can only respond on the active path.
Fix: Below is a list of the different types of “0102 errors with ASC of 0403” seen on AIX hosts with CLARiiON or Symmetrix attached.
An ASC of 0403 correlates to LOGICAL UNIT NOT READY, MANUAL INTERVENTION REQUIRED.
These errors do not cause host or application failures but are a nuisance to customers as they fill up error logs and often generate failure tickets which must be processed.

1). AIX host connected to CLARiiON logs one SC_DISK_ERR2 (0102 with ASC 0403) error per passive hdisk device about every 2 hours.
FIXED - This was caused by a bug in ECC 5.1 and is fixed in ECC hotfix 1404. This fix is included in ECC 5.1 SP2 or ECC 5.2 SP1 which is now GA.

2). AIX host connected to CLARiiON logs one SC_DISK_ERR2 (0102 with ASC 0403) error per passive hdisk device about every 24 hours which is when the Tivoli utility “/usr/tivoli/tsm/tdp_r3/ora/ProLE -p tdpr3ora” starts. When ProLE from TDP starts, the errors are reported. The process is "fired" every 24 to 25 hours, depending on SAP R3.
NO FIX to date. Have customer open case with IBM Tivoli support.

3). Both CLARiiON and Symmetrix customers have reported seeing one SC_DISK_ERR2 (0102 with ASC 0403) error on any hdisk in a “Not Ready” state, when they run the "importvg” command in AIX 5.1. A “Not Ready” device would include any BCV or Clone device in an established state or devices locked "varied on" by another host in a shared disk environment. The errors are not reported on ‘Passive’ CLARiiON devices. This cannot be reproduced in AIX 4.3.3.! It happens with or without PowerPath!
A workaround is to use the “-F” option as follows "importvg -y <vgname> hdiskpower<#> -F”

4). AIX host connected to CLARiiON logs one SC_DISK_ERR2 (0102 with ASC 0403) error per passive hdisk device about every 7 days which is when a scheduled script containing the "snap -L" command is run. The error is reported back by the diagnostic "diagela" utility.
Workarounds: do not run the "snap -L" command or disable "diagela" for the CLARiiON devices (see solution emc73167 for more details of this occurrence).


NOTE: See solution emc65499 for more information.


spacer
 
dukessd, didn't realize i hijacked the thread. sorry. i searched on the error code i had and got this thread with the same error.

thanks, plamb. that's exactly the problem i'm having... and it's exactly the problem that mag0007 is having.

so does anyone know how to disable diagela for specific devices. i looked on ibm's site before i came here.
 
p5wizard:

To answer your questions, yes I am able to get to my disks (hdiskpowerXX). All the errors on my errpt show hdiskXX not hdiskpowerXX (if that was the case, I don't think I can access data).

My question is what is hdiskXX in my server, is it a virtual path?

Also, I will be speaking with my SAN team later today :)


 
IMHO hdiskpowerxx is a collection of hdiskyy + hdiskzz + hdisknn + hdiskmm which represent the different paths (via different SAN fabrics) to the same LUNs. AIX native sees hdisk devices (different numbers for the same LUNs). The powerpath driver joins these hdisks for the same LUN into a hdiskpower pseudo device.

If the EMC box has active/passive controllers as has been stated here before, then it is normal for hdiskxx disks to show up in the error log. You'd also see inactive hdisks if you do powermt display command. Then follow other posts in this thread on how to disable diagela for these inactive paths.

I'm comparing to IBM SDD driver (collection of hdisks into a vpath pseudo disk device) which I assume works similarly to powerpath.



HTH,

p5wizard
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top