brianlhunt
IS-IT--Management
- May 20, 2005
- 6
PROBLEM
The CPQRAID driver is failing the array controller
DETAILS:
1. A change in controller status occurs (have seen this message - CPQRAID:
The Controller in slot 0 has been failed by the driver and requires
corrective action).
2. The Insight Manager console receives a trap for a controller status change event. When the event is viewed from the Insight Manager console, the array controller is shown in powered off state.
3. At this point, there are a two different scenarios that may occur based on the hardware configuration.
3a. For hard disk storage attached to the affected controller, all NSS pools proceed to deactivate and all volumes dismount. Additionally, the C: drive also becomes inaccessible.
3b. For tape drive hardware attached to the affected controller, the tape drive becomes inaccessible and the backup software find it can no longer communicate with the tape drive.
4. Depending on scenario 3a or 3b above, the stability of the server will vary.
4a. With all the volumes dismounted, the server typically becomes unstable and unresponsive. Network communications fail also. No core dump may be gathered locally or remotely at this point due to the inability to communicate with local storage or across the network. And there is no log information written since the volumes have dismounted. Server has to typically be power cycled to recover but there have been instances where it was possible to gracefully reset the server.
4b. With the tape drive inaccessible, the server continues to function normally, but cannot be backed until it is reset.
Currently no one has been able to duplicate the problem in the lab.
Novell and HP are engaged and are no farther in identifying a solution than I am.
The biggest hurdle in troubleshooting the issue, other than the inability to gather server state information when the problem occurs, is the lack of frequency of the events. For the most part, the problem occurs and may not occur again. Lots of one-and-done servers, but there have been a number of servers with multiple occurrences though and they average about 90 days between events.
KEY FACTORS:
HP server hardware
> ProLiant DL380 G2, G3, ML530 G1, G2
Smart Array controllers
> 5i, 6i, 5300, 6400)
Novell NetWare 6
> problem started with SP3 and continues with SP5
Post SP5 NSS Modules for NetWare 6
> NW6NSS5b
ProLiant Support Pack 7.10 (currently, but had problem with 6.10)
> CPQRAID 2.11 (currently, but also had problem with 2.05, 2.08, and 2.09)
UNLIKELY CAUSES:
- NSS
- Server Load
LIKELY CAUSES:
#1 CPQRAID
#2 Insight Management Agents
POSSIBLE WORKAROUND:
Unload Insight Management Agents
WHAT ACTIONS HAVE BEEN TAKEN
1. Insight Management Agents have been unloaded and the problem effectively went away, BUT they may not have been unloaded for a long enough period to make that solid determination. We actually has a one (1) month period recently where the problem did not occur once on any servers.
2. Have removed select Insight Management Agents from loading and determined that with even CPQDASA, CPQSCSA, and CPQSSSA unloaded, the problem still occurs. I was counting on CPQDASA being the problem at one time, but this proved me wrong.
3. On a server where the 5i was failed by the driver, we were able to:
3a. Unload ARCserve
3b. Unload CPQRAID
3c. Reload CPQRAID
3d. Mount All
3e. Load ARCserve successfully and have it find the controller and tape drive.
(Insight Manager console also reported a status change to OK)
NEXT STEPS
1. Attempt the same actions of unloading and reloading CPQRAID on a server that had the volumes dismount, but not execute a mount all and just see if we can communicate with the C: drive. If we can communicate, attempt to get a core dump.
MY CALL FOR HELP
1. Has anyone else run into this type of problem I have detailed here?
2. Does anyone have any insights regarding the problem and a potential fix?
Thanks.
BRiAN HUNT
The CPQRAID driver is failing the array controller
DETAILS:
1. A change in controller status occurs (have seen this message - CPQRAID:
The Controller in slot 0 has been failed by the driver and requires
corrective action).
2. The Insight Manager console receives a trap for a controller status change event. When the event is viewed from the Insight Manager console, the array controller is shown in powered off state.
3. At this point, there are a two different scenarios that may occur based on the hardware configuration.
3a. For hard disk storage attached to the affected controller, all NSS pools proceed to deactivate and all volumes dismount. Additionally, the C: drive also becomes inaccessible.
3b. For tape drive hardware attached to the affected controller, the tape drive becomes inaccessible and the backup software find it can no longer communicate with the tape drive.
4. Depending on scenario 3a or 3b above, the stability of the server will vary.
4a. With all the volumes dismounted, the server typically becomes unstable and unresponsive. Network communications fail also. No core dump may be gathered locally or remotely at this point due to the inability to communicate with local storage or across the network. And there is no log information written since the volumes have dismounted. Server has to typically be power cycled to recover but there have been instances where it was possible to gracefully reset the server.
4b. With the tape drive inaccessible, the server continues to function normally, but cannot be backed until it is reset.
Currently no one has been able to duplicate the problem in the lab.
Novell and HP are engaged and are no farther in identifying a solution than I am.
The biggest hurdle in troubleshooting the issue, other than the inability to gather server state information when the problem occurs, is the lack of frequency of the events. For the most part, the problem occurs and may not occur again. Lots of one-and-done servers, but there have been a number of servers with multiple occurrences though and they average about 90 days between events.
KEY FACTORS:
HP server hardware
> ProLiant DL380 G2, G3, ML530 G1, G2
Smart Array controllers
> 5i, 6i, 5300, 6400)
Novell NetWare 6
> problem started with SP3 and continues with SP5
Post SP5 NSS Modules for NetWare 6
> NW6NSS5b
ProLiant Support Pack 7.10 (currently, but had problem with 6.10)
> CPQRAID 2.11 (currently, but also had problem with 2.05, 2.08, and 2.09)
UNLIKELY CAUSES:
- NSS
- Server Load
LIKELY CAUSES:
#1 CPQRAID
#2 Insight Management Agents
POSSIBLE WORKAROUND:
Unload Insight Management Agents
WHAT ACTIONS HAVE BEEN TAKEN
1. Insight Management Agents have been unloaded and the problem effectively went away, BUT they may not have been unloaded for a long enough period to make that solid determination. We actually has a one (1) month period recently where the problem did not occur once on any servers.
2. Have removed select Insight Management Agents from loading and determined that with even CPQDASA, CPQSCSA, and CPQSSSA unloaded, the problem still occurs. I was counting on CPQDASA being the problem at one time, but this proved me wrong.
3. On a server where the 5i was failed by the driver, we were able to:
3a. Unload ARCserve
3b. Unload CPQRAID
3c. Reload CPQRAID
3d. Mount All
3e. Load ARCserve successfully and have it find the controller and tape drive.
(Insight Manager console also reported a status change to OK)
NEXT STEPS
1. Attempt the same actions of unloading and reloading CPQRAID on a server that had the volumes dismount, but not execute a mount all and just see if we can communicate with the C: drive. If we can communicate, attempt to get a core dump.
MY CALL FOR HELP
1. Has anyone else run into this type of problem I have detailed here?
2. Does anyone have any insights regarding the problem and a potential fix?
Thanks.
BRiAN HUNT