Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Netfinity 5000 Crash - ServeRAID, backplane or drive problem?

Status
Not open for further replies.

Ritosh

Technical User
Jun 6, 2004
8
CA

I have three IBM Netfinity servers peer-to-peer networked, all using Windows 2000 Pro. Recently (June 26) my Netfinity 5000 (8659-22Y) crashed. I have three 18.2 GB drives (all IBM DNES-318) in an array at RAID-5. There are two additional 18.2 GB drives slotted in (both IBM DNES-318 as well), but they're not part of the array, only been sitting there at Ready, as I bought them later and haven't got around to working them in yet.

On the surface indications point to the three drives in the array, as all are now defunct (so the logical drive is Offline), the three front lights now amber. But I think they've been set to DDD by the server as a precaution since something else has failed (besides, the odds of all three dying at the same time seems a slight stretch). What exactly the source problem is I don't know, although I suspect it's the backplane, or possibly the ServerRAID controller (3L Ultra II, BIOS 6.11.07), although elements of both seem functional.

The scenario is confusing because the system error log, and ServeRAID dump log are unclear to me about the source issue. I don't know how to read the error codes in the logs, can't find any place on the web that can, and IBM support is too expensive to consider. I haven't backed up the server for awhile, so am eager to rebuild the array, if possible, but am trying to be methodical about this since I don't want to lose any data.

Up until the crash everything seemed to be working properly -- nothing out of the ordinary to report. The issues that stand out from the logs are 1) that at some point one of the two drives at Ready (the fifth one, at ID 4) was flagged PFA, although I never noticed any error message; 2) according to the system diagnostics log, three days before the crash there were two entries that indicated a ServeRAID controller failure, or "internal error":

Entry Number: 22
Date/Time: 2005/06/23 22:04:40
DMI Type: 08
Source: DIAGS
Error Code: 035-260-499-20050623-57-RAID Interface: Failed
Error Code: CPPRHTS1&2
Error Data: (Adapter in slot 4; internal error)
Error Data:


Entry Number: 23
Date/Time: 2005/06/23 22:03:41
DMI Type: 08
Source: DIAGS
Error Code: 035-260-499-20050623-57-RAID Interface: Failed
Error Code: CPPRHTS1&1
Error Data: (Adapter in slot 4; internal error)
Error Data:

3) the day of the crash I unplugged and rebooted the system, to which the system log wrote that the backplane couldn't be found, although I don't know if that means it's actually dead; I think it shows that the LED didn't work for it, but I don't know if that means there's a problem with the LED or the actual backplane (maybe someone can tell the difference from the entry):

Entry Number: 1
Date/Time: 2005/06/26 16:07:15
DMI Type: 08
Source: DIAGS
Error Code: 180-357-000-20050626-94-Real-time Status Displays:
Error Code: Failed LED&1
Error Data: (Hard Drive backplane not found)
Error Data:

From what I combed out of IBM's documentation if I try to rebuild the array when the source of the problem is actually something like the backplane then I could lose all my data. (But of course the documentation suggests sending the logs into IBM to decode. Ack!) If anyone can read the error codes, or give me some wisdom on what steps to take it would be greatly appreciated. Thanks very much.
 
My best guess is that it's not the ServeRAID controller, the backplane, or the HDDs. The HDDs correctly indicate that they are confused (hence the amber lights), and the communcation was properly communicated from the Controller (it doesn't know what happened to the HDDs)through the backplane. Backplanes don't typically catastrophically fail. Usually, one or more slots will act funny or not function at all. Where does this lead? The SCSI cable. When everything pukes all at the exact same time, it's usually the connection between the pieces. Hard parts (boards and backplanes) don't generally fail across the board partly. They fail completely. If the HDDs are still getting power and lights, it's probably not it. Besides, the cable is the only flexible, and consequently the most vunerable part (cheapest too). With the server off and disconnected from power (to stop the ASM monitoring), check that both ends are firmly connected (read remove and re-attach) and look the cable over for any "new" crimps or cuts while you're in there.
 
Thanks catorze, your input's much appreciated. I moved this inquiry to the IBM Disk & RAID Solutions thread as I wasn't getting any responses here. It turns out the problem was the ServeRAID card. I'd considered the possibility the problem was the cable or a connection, but they checked out ok. I replaced the controller and everything fell into place: no data loss, all else checked out fine. Thanks again.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top