CPQRAID fails controller on NetWare 6

brianlhunt · May 20, 2005

PROBLEM

The CPQRAID driver is failing the array controller

DETAILS:

1. A change in controller status occurs (have seen this message - CPQRAID:
The Controller in slot 0 has been failed by the driver and requires
corrective action).
2. The Insight Manager console receives a trap for a controller status change event. When the event is viewed from the Insight Manager console, the array controller is shown in powered off state.
3. At this point, there are a two different scenarios that may occur based on the hardware configuration.

3a. For hard disk storage attached to the affected controller, all NSS pools proceed to deactivate and all volumes dismount. Additionally, the C: drive also becomes inaccessible.

3b. For tape drive hardware attached to the affected controller, the tape drive becomes inaccessible and the backup software find it can no longer communicate with the tape drive.

4. Depending on scenario 3a or 3b above, the stability of the server will vary.

4a. With all the volumes dismounted, the server typically becomes unstable and unresponsive. Network communications fail also. No core dump may be gathered locally or remotely at this point due to the inability to communicate with local storage or across the network. And there is no log information written since the volumes have dismounted. Server has to typically be power cycled to recover but there have been instances where it was possible to gracefully reset the server.

4b. With the tape drive inaccessible, the server continues to function normally, but cannot be backed until it is reset.

Currently no one has been able to duplicate the problem in the lab.

Novell and HP are engaged and are no farther in identifying a solution than I am.

The biggest hurdle in troubleshooting the issue, other than the inability to gather server state information when the problem occurs, is the lack of frequency of the events. For the most part, the problem occurs and may not occur again. Lots of one-and-done servers, but there have been a number of servers with multiple occurrences though and they average about 90 days between events.

KEY FACTORS:

HP server hardware
> ProLiant DL380 G2, G3, ML530 G1, G2
Smart Array controllers
> 5i, 6i, 5300, 6400)
Novell NetWare 6
> problem started with SP3 and continues with SP5
Post SP5 NSS Modules for NetWare 6
> NW6NSS5b
ProLiant Support Pack 7.10 (currently, but had problem with 6.10)
> CPQRAID 2.11 (currently, but also had problem with 2.05, 2.08, and 2.09)

UNLIKELY CAUSES:

- NSS
- Server Load

LIKELY CAUSES:

#1 CPQRAID
#2 Insight Management Agents

POSSIBLE WORKAROUND:

Unload Insight Management Agents

WHAT ACTIONS HAVE BEEN TAKEN

1. Insight Management Agents have been unloaded and the problem effectively went away, BUT they may not have been unloaded for a long enough period to make that solid determination. We actually has a one (1) month period recently where the problem did not occur once on any servers.
2. Have removed select Insight Management Agents from loading and determined that with even CPQDASA, CPQSCSA, and CPQSSSA unloaded, the problem still occurs. I was counting on CPQDASA being the problem at one time, but this proved me wrong.
3. On a server where the 5i was failed by the driver, we were able to:

3a. Unload ARCserve
3b. Unload CPQRAID
3c. Reload CPQRAID
3d. Mount All
3e. Load ARCserve successfully and have it find the controller and tape drive.
(Insight Manager console also reported a status change to OK)

NEXT STEPS

1. Attempt the same actions of unloading and reloading CPQRAID on a server that had the volumes dismount, but not execute a mount all and just see if we can communicate with the C: drive. If we can communicate, attempt to get a core dump.

MY CALL FOR HELP

1. Has anyone else run into this type of problem I have detailed here?
2. Does anyone have any insights regarding the problem and a potential fix?

Thanks.

BRiAN HUNT

terry712 · May 21, 2005

never seen it before

what version of the insight agents are you using. i would be inclined to backrev to say version 6.4 or something like that

marvhuffaker · May 21, 2005

Exactly how many servers are experiencing this?

I have not seen this specific problem before, but I have seen agents crash servers before. So I would suspect the insight agents themselves.

Maybe you could have everything loaded on one less-important server, and remove the suspect components from all of the remaining systems.. Run that way for a while and see how it goes. Try to narrow things down.

Another thought.. You've had this problem since NW6 SP3? There were some probs with SP2 that could cause corruption to the NSS pools. Patching the servers with Post SP2 patches and beyond fix the code but not the pools themselves. Have you tried verifying the pools to make sure there are no problems?

Marvin Huffaker, MCNE

http://www.redjuju.com

brianlhunt · May 23, 2005

Terry,

Thanks for you reply.

Backreving to a previous version of Insight Management Agents is not an option. Version 6.10 had the problem. Don't know about version 6.40. 7.10 has the problem though. We are currently looking at version 7.30.

Additionally, the newer agents are critical to proper hardware reporting with regard to newer hardware (i.e. ProLiant DL380 G4).

Thanks again.

BRiAN L. HUNT

brianlhunt · May 23, 2005

Marvin,

Thanks for your reply.

We have tried not loading certain insight management agents but the problem has persisted. The only option I see with the agents is to unload them fully and see what happens. Unfortunately, based on the problem history, it could take as long as 90 days before we could possibly state the problem is with the agents.

I don't suspect NSS as the root of this problem. Wile I can see your point of thinking with regard to the NSS pools created with SP2, we have actually had new servers built based on SP5 experience the problem shortly after going into production.

I am still targetting CPQRAID as the root cause. I just need to prove it to HP.

Thanks anyway.

BRiAN L. HUNT

marvhuffaker · May 23, 2005

Brian, I know that what you are dealing with is frustrating, and it's always nice when you have a quick solution. Unfortunately, sometimes you have to be patient and play the 'wait and see' game.

If there was a quick answer, it sounds like you would have already found it.

Marvin Huffaker, MCNE

http://www.redjuju.com

terry712 · May 24, 2005

do you have this problem on all servers or just a few?

i've never seen this at all - and all our stuff is compaq - to be truthful i have a mix of agents just now - mainly 7.2 and a few 6.4's but i have used most - the only prob i've seen was on 6.3 with netware 4.11 - if you loaded cpqdasa and went intop edit the array frooze

are all server clean builds or are they upgrades

built from a smartstart ?

LLindsay · May 24, 2005

I've seen something similar to this. We are a "Proliant" shop. Our problem occurred during the load of the CPQSNMP.NLM. A triggered module called CPQBSSA.NLM would not load and for some reason caused problems, especially with our DL380s and DL360s clusters attached to our IBM SAN (Shark) and MSA500s. The problem went away after running the current CPQDPLOY.NLM.

http://h18023.www1.hp.com/support/files/server/us/download/22273.html

brianlhunt · May 25, 2005

I am going to respond to a few of the reply posts in one post here.

1. We are having the problem on 150+ servers which makes up about 1/3 of our NetWare 6 environment at the present.
2. Problem has been occuring since June 2004. (Started almost one year ago, so there lies my frustration with not have a root cause)
3. All server builds are clean builds.
4. We do not build from SmartStart.
5. Our Insight Management agents are deployed using CPQDPLOY.NLM. Version 7.30 was just deployed in Dec/Jan timeframe and we are now preparing to deploy 7.30.

Thanks to all of you who have replied with your comments and suggestions.

BRiAN L. HUNT

marvhuffaker · May 25, 2005

If I was HP & Novell... with a customer as big as you, I'd have someone onsite trying to kick this one in the rear.

Marvin Huffaker, MCNE

http://www.redjuju.com

marvhuffaker · May 25, 2005

Also, I know you mentioned that you need the agents for reporting, etc.. But really, which one is more important? Reporting or a stable server?

Obviously you are frustrated and this has been eating at you, and having A.T. rag on you in the N forums surely doesn't help.. But if you step back and look at the big picture, I think you could slice this up and create a few different scenarios to test your theories with. No, it won't be a quick solution, we've already established that.

But what if you could remove the agents from 10 servers... If they are still running stable in 6 months, you can make a good case against the agents. If they still crash, well you can eliminate them as a suspect. Work down the line like that, it's the best you can do.

I'd also bet money that Compaq is already onto the problem, but they aren't fessing up to anything. Maybe a new agent release will come out that magically fixes it. They'll admit no blame and your servers will be stable again.

BTW, you said you can't seem to get coredumps or abend logs, but can you get a Compaq system event log? You should at least get SOMETHING in there..

I also recently had a problematic environment where a few of the main DS servers would abend at least once a day. It took a full year for Novell to identify the bug in their licensing module and provide me with a fix. This doesn't help you, but I do understand your pain. Downtime sucks.

Marvin Huffaker, MCNE

http://www.redjuju.com

brianlhunt · Sep 27, 2005

!UPDATE!

Breakthrough Test

While doing some research, some technical documentation was reviewed that indicated drives would deactivate when utilities scan or perform management on the drives. For example, LOAD CONFIG, LIST DEVICES, and SCAN FOR NEW DEVICES are utilities/commands that can scan or perform management on the drives.

I used this information and accelerated the regular server processes on a lab server to perform the same daily processes in less time and stress the server. My first iteration triggered an event in approximately 16 hours (in normal time we were seeing spans of 90 days between events) and through changes to the timing of the processes (executing a script called TRIGGER.NCF every minute), I was able to trigger an event in less than two (2) hours.

The TRIGGER.NCF script is as shown here:

LIST DEVICES
?Y # Waiting for 10 seconds...
SCAN FOR NEW DEVICES
?Y # Waiting for 10 seconds...
LOAD CONFIG /merna0vbl
?Y # Waiting for 10 seconds...
LIST DEVICES
?Y # Waiting for 10 seconds...
SCAN FOR NEW DEVICES
?Y # Waiting for 10 seconds...
LOAD CONFIG /merna0vbl

This test was a significant step forward in identifying the root cause of the problem. This information and script were provided to HP for them to accelerate their testing with an easily repeatable process to trigger the event.

Cause

HP has determined the cause of this event which we have named the All Drive Partitions Unavailable problem. Through analysis and troubleshooting, HP has isolated the condition to the following events.

1.An scan command from the driver to the firmware gets stuck
2.The array controller driver tries to abort the command and times out trying.
3.The driver fails the controller as a result of the time-out
4.The array is taken off-line when the drive fails the controller
5.The NSS pools deactivate and volumes dismount as a result of the array being taken off-line.

Fix

Currently, there is no fix for the All Drive Partitions Unavailable problem.

HP engineers are working with their developers to make a change to either the array controller driver or the array controller firmware to address and resolve the problem.

Workaround

Since the All Drive Partitions Unavailable condition looks to be directly related to any utilities (i.e. CONFIG, LIST DEVICES, and/or SCAN FOR NEW DEVICES) that scan or perform management on a drive or drives. Until a solution is provided, I recommended (for my environment) that the number of times per day a CONFIG report is executed be reduced to minimize this activity that contributes to the problem.

The current recommendation is to reduce the number of CONFIG reports from seven (7) per day (six (6) text format, one (1) XML format) to two (2) per day (one (1) text format, one (1) XML format).

This is not a solution for the problem, nor does it prevent it from occurring. It simply reduces the likelihood of it occurring and so far has proven to be effective.

Additional Information

Additional testing was performed to investigate whether event could be triggered on other hardware platforms or with Novell NetWare 6.5.

The All Drive Partitions Unavailable problem could not be triggered on a HP ProLiant DL580 G2 with Novell NetWare 6 SP5, nor a Dell PowerEdge 2650 with Novell NetWare 6 SP5.

The All Drive Partitions Unavailable problem was triggered on a HP ProLiant ML530 G2 with Novell NetWare 6.5 SP3 and a HP ProLiant ML530 G2 with Novell NetWare 6.5 SP3 and Post SP3 updated SERVER.EXE. It did though take significantly longer to trigger the event on NetWare 6.5 than NetWare 6.0.

BRiAN L. HUNT

marvhuffaker · Sep 27, 2005

That's cool that you're on top of it. I find it odd that this hasn't been seen elsewhere. But at the same time, why do you need to run CONFIG so much? Wouldn't it be safe to say that your server config last week is pretty much the same as it is this week?? Why is it even necessary to run daily? Are you using it to report on drive space usage? That's the only thing I can think of.. Or is the CONFIG process one of those Routines that your company has been running for years, nobody knows why, and nobody dares to stop doing it? Not trying to downplay, but that just strikes me as odd.

I remember a client a few years back that had a company wide mandatory procedure to Reboot the Novell servers once a week, run an Unattended Full DS Repair, etc.. Even though it's not necessary, someone at one time thought it was and they just kept doing it. Since it was a government deal, trying to change it was harder than just going with the flow I guess.

Marvin Huffaker, MCNE

http://www.redjuju.com

brianlhunt · Sep 28, 2005

I was definitely excited to find a method to trigger the event. I too find it odd it has been mainly isolated to where I work. I did find another person with the same issue, but they were a very small server infrastructure and not much help (I am actually helping them more than they have me).

I think you are probably correct about the CONFIG processes being something that has been running for years. The CONFIG reports are collected daily for metrics gathering (i.e. server compliance checks), but running CONFIG six times a day did not really benefit us.

I hope once I have the fix from HP, production support will not revert back to running CONFIG six times a day.

BRiAN L. HUNT

marvhuffaker · Sep 28, 2005

I guess I don't understand what could change so dramatically in 1 day that you'd need to generate a new config file. And any change is going to be invoked manually - like, new hardware, added storage, changed nic, patches applied.. And those will generally require a reboot.. So if the server is just running for 6-8 months at a time, it really shouldn't change at all.

But politics obviously plays a part here, and in that case, even the best logic doesn't matter.

Good luck.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

CPQRAID fails controller on NetWare 6

brianlhunt

IS-IT--Management

terry712

Technical User

marvhuffaker

MIS

brianlhunt

IS-IT--Management

brianlhunt

IS-IT--Management

marvhuffaker

MIS

terry712

Technical User

LLindsay

MIS

brianlhunt

IS-IT--Management

marvhuffaker

MIS

marvhuffaker

MIS

brianlhunt

IS-IT--Management

marvhuffaker

MIS

brianlhunt

IS-IT--Management

marvhuffaker

MIS

Similar threads

Part and Inventory Search

Sponsor