Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

errpt -a?

Status
Not open for further replies.

khalidaaa

Technical User
Jan 19, 2006
2,323
BH
Hi all,

I came today morning and checked the errpt -a and here is the output of that:

Code:
---------------------------------------------------------------------------
LABEL:          FCP_ARRAY_ERR4
IDENTIFIER:     D5385D18

Date/Time:       Thu Jun  8 20:15:43 SAUST 2006
Sequence Number: 2212
Machine Id:      00CF359F4C00
Node Id:         s1cdbp
Class:           H
Type:            TEMP
Resource Name:   hdisk3          
Resource Class:  disk
Resource Type:   array
Location:        U7879.001.DQDGAFT-P1-C2-T1-W200200A0B817BAC1-L1000000000000

Description
ARRAY OPERATION ERROR

Probable Causes
ARRAY DASD MEDIA
ARRAY DASD DEVICE

Failure Causes
DASD MEDIA
DISK DRIVE

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A00 2A00 00AE 3588 0000 0804 0000 0000 0000 0000 0000 0127 0200 0300 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 36FB 6000 F705 3207 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 
---------------------------------------------------------------------------
LABEL:          FCP_ARRAY_ERR4
IDENTIFIER:     D5385D18

Date/Time:       Thu Jun  8 20:15:43 SAUST 2006
Sequence Number: 2211
Machine Id:      00CF359F4C00
Node Id:         s1cdbp
Class:           H
Type:            TEMP
Resource Name:   hdisk2          
Resource Class:  disk
Resource Type:   array
Location:        U7879.001.DQDGAFT-P1-C2-T1-W200200A0B817BAC1-L0

Description
ARRAY OPERATION ERROR

Probable Causes
ARRAY DASD MEDIA
ARRAY DASD DEVICE

Failure Causes
DASD MEDIA
DISK DRIVE

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A00 2A00 0007 2410 0000 1004 0000 0000 0000 0000 0000 0A1A 0200 0300 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 36FA B000 F705 3207 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 
---------------------------------------------------------------------------
LABEL:          TS_NIM_ERROR_STUCK_
IDENTIFIER:     864D2CE3

Date/Time:       Thu Jun  8 20:15:43 SAUST 2006
Sequence Number: 2210
Machine Id:      00CF359F4C00
Node Id:         s1cdbp
Class:           S
Type:            PERM
Resource Name:   topsvcs         

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.2,5492              
ERROR ID 
6XnGH40zg3W2/N0A/K4U1/0...................
REFERENCE CODE
                                          
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
          37
Interface name
rhdisk2

Any one knows what's wrong?

Regards
Khalid
 
Just a guess....
Ask the san guys if they were zoning a switch at this time.
As the disk errors are "Type: TEMP" AIX recovered, but for a moment there was no response from hdisk2 and 3, obviously san attached disks on a power 5 machine.
You see this sort of thing when a switch has a bit of a flutter because it's zoning is updating (check for fibre fileset and firmware updates and ask the SAN chaps to look into switch firmware updates).
If they were not playing with the switch ask them to look at the switch logs, or ask storage to look at the storage device logs at the time your p5 posted these messages.
 
Thanks DukeSSd

I beleive this is the second time i receive these types of error!!! but no one is playing with san by that time as far as i know!!! i'm the admin of the san (though i don't know much about it as it was installed by the IBM engineer)

so i expect the san to be stable somehow!!!

yes you are right its a p5 570 machine attached to the san and this lpar is a cluster lpar with a second lpar located in another p5 570!!!

i beleive the swtich is zoned by the IBM engineer but no changes was done since for this switch!

i will try to login to the switch and see.

any other thoughts?

Thanks

Regards,
Khalid
 
While i was trying to find a solution for my problem (Which i expected more than just DukeSSD to help even though i'm very thankful for him :))

I found this link which is talking about the last error in my post


but i still couldn't get why this "trying to take over" is happening?!?!

Your help is appreciated folks.

Thanks

Regards,
Khalid
 
Is your node part of a HA cluster? If not, then there is no takeover in terms of HA because there is no HA installed.
If you don't use HA, I think the APAR in your link points to another problem with same symptoms.
Since it was only TEMP and you have to rely SAN-wise to the IBM technican, check if all cables are correctly attached (maybe the cleaning team got a bit too harsh with their tools , hehe) - check your array's logs, if there are any. Maybe you can see a trespassed LUN or something that seems suspect.


laters
zaxxon
 
Zaxxon,

Thank you very much for replying to this post coz i was waiting for long long time.

yes i'm using HACMP and yes this node is a cluster with another one

i don't think its the cleaning team because they don't reach that far :)

Regards
Khalid
 
For the APAR description from IBM, corresponding to your last entry posted from the errpt, it just says that your NIM client or master had no access to the disks and it guesses that there was a lot of traffic, that might have interrupted communication with the disks/array.
You should for sure check your HACMP logs/traces if a takeover or something like that happened at the time, you got your errpt entries for the disk problems.

We had something similar last days where for several minutes some disks were not accessible for several minutes... and the machine itself looks good but the errpt entries losing the paths to the disks while no one was working at the SAN switch or array.

Maybe your SAN guy has some tool to log the switch or at least writes down the error counters every 4 hours or something, to check, if the problem occurs again, if they have significantly grown.

Check your array for errors or if any LUNs are trespassed or something like that.

Maybe a firmware update of the SAN switch or the array might help, if it occurs again and no one can find anything to be guilty.
It is kinda a helpless situation, I know it myself ;)

laters
zaxxon
 
Thanks zaxxon for your help again :)

I'm looking after the SAN as well (so there is no SAN guys :))

to be honest i forgot how to log in to the Switch and check for the logs :p because it was done by a consultant long time ago and i didn't have to do it again but i will try to google for it :)

I got the following error yesterday as well for this machine :(

Code:
LABEL:          TS_LOC_DOWN_ST
IDENTIFIER:     173C787F

Date/Time:       Wed Jun 14 23:25:17 SAUST 2006
Sequence Number: 2216
Machine Id:      00CF359F4C00
Node Id:         s1cdbp
Class:           S
Type:            INFO
Resource Name:   topsvcs         

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

        Recommended Actions
        Verify adapter configuration
        Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.2,4822              
ERROR ID 
6zV5DL.h05Y2/Gol1K4U1/0...................
REFERENCE CODE
                                          
Adapter interface name
en0
Adapter offset
           0
Adapter IP address
10.1.1.150

I don't know why this happened but it was for sometime and then came back normal.

I didn't go for a course in HACMP but i'm supposed to support it.

Regards,
Khalid
 
Ah yes, the NFS thingy from the other thread, I see. For this, check the other thread; I wrote an answer already.

laters
zaxxon
 
I've already checked that :)

Thanks

regards,
Khalid
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top