Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

IBM x235 ServeRAID 6i RAID 1 array failure - Replacement Procedure?

Status
Not open for further replies.

Miescha

IS-IT--Management
Mar 21, 2011
13
US
IBM x235 Server
ServeRAID 6i Controller

6 146gb physical drives installed
Setup as 3 logical drives / 3 arrays with RAID 1 mirroring

BIOS 7.12.13
FIRMWARE 7.12.13
DEVICE DRIVER 7.12.02 (?????)

User received i9990301 DISK FAILURE OR DISK RESET FAILED error
User pulled physical drive 0 (bottom-most drive)and rebooted.
Physical drive 0 is first drive in logical disk 1 / array A.

User states prompted to mark pulled drive defunct and did so.
User got i9990301 error.
User replaced drive 0 and pulled drive 1 (other drive in logical disk 1 / array A).
User states prompted to mark pulled drive defunct and did so.
User got i9990301 error.

User called me - I booted to IBM ServeRAID Manager CD and discovered mismatched bios/firmware/device driver versions (but doesn't seem to be causing a problem previously). I figure I will resolve that later with newer version 7.12.14 versions for all.

ServeRAID Manager showed array A offline and both physical drive 0 and 1 defunct / offline. I powered down, pulled both drives, rebooted, rescanned, inserted drive 1 and set it online, then inserted drive 0 and nothing.

I repeated but inserted drive 0 and set it online, then inserted drive 1 and rebuild started nearly immediately.

If memory serves me, I cannot reboot until the rebuild is complete - correct?

Also, even if the rebuild is not successful (the drive is bad) I'm thinking the system should boot and run but with this drive defunct/offline so Array A will show as critical or degraded until I get a replacement drive in place (probably tomorrow) and a good rebuild - correct?

Is there something else I'm missing that could be causing the i9990301 error and boot failure?

Is this the proper way to replace a bad/critical drive given that the user had already marked both drives as defunct?

What is the best practice for replacing a critical drive before it is marked as defunct?

Enough questions - thanks for any assistance.

-Miescha
 
If memory serves me, I cannot reboot until the rebuild is complete - correct?"

From what I recall - this is NOT correct. You can reboot and it will continue where it left off.


"Is there something else I'm missing that could be causing the i9990301 error and boot failure?"

I would update the firmware on all the hard drives to the latest level. Bad firmware can cause strange behavior and it can kill drives in extreme cases. I would also update the firmware on the controller as this can fix many issues. Soon after that or perhaps concurrenlty (read the instructions for your O.S.) update the driver to go with that new firmware.

"Is this the proper way to replace a bad/critical drive given that the user had already marked both drives as defunct?"

If one drive in a mirror group is bad, then yes that's what you would need to do. If both are bad in the same array, you are out of luck. See page 139 of this document for exact procedure.

"What is the best practice for replacing a critical drive before it is marked as defunct?"

I think you're confusing CRITICAL as relating to LOGICAL drives VS. DEFUNCT relating to PHYSICAL drives. A defunct PHYSICAL drive will render a RAID1 LOGICAl drive in CRITICAL condition.

Do some reading and I advise caution on having other people change states of drives/arrays or pulling drives for you.
 
Thanks for the advice Goombawaho - much appreciate!

I *thought* the process for updating all firmware was done using the serveraid manager boot CD. How can I determine the firmware version on the drives and determine the device driver version? I'm suspecting my problem is connected with these various versions after a long night of testing hardware.

ServeRAID manager shows the controller and all drives online and ready, but I still get the i9990301 boot error.

I did mis-state the critical/defunct question - thanks.
 
Look here for updates:


Under "hard drives" for one thing I mentioned.

That's all pretty old stuff. Search around and see if there is a similar NEWER page with newer stuff.

The basics of what it is and what it does:
ftp://ftp.software.ibm.com/systems/support/system_x/ibm_fw_scsi_hdd_v119_anyos_i386.txt
 
Everything I find is definitely pretty old - but that's because the hardware is old ;-) But I guess if it is still working (prior to this) then no one is too eager to spend money in the current economy.

My current question - is there a way I can view the individual drive data for each drive in the array?

 
It's been so long since I fiddled with it, it's not in my head. But somewhere in either the Serveraid manager or using the bootable CD, you can right click on a drive and get the information.
 
Version 1.19b is the latest hard drive firmware for that system. Also there are no functional changes to the windows driver since 6.18, just altered the .inf to reflect the latest level. Though it will stop the nags that there is a mismatch. ipssend is a diagnostic tool to pull logs and get status of the raid system if servraid manager will not see the card or arrays.

There are several ways to update the servraid firmware. you can either use an update from within the OS, the servraid support CD, which also has servraid manager on a linux load. The 3rd way to update that few know about is BoMC ( bootable media creator) If you are familiar with IBM you will have heard of update express. This tool allows you to create a bootable cd,usb drive to update an IBM server with all of the firmware that the system may have. It will
even download tape drive firmwares, but when it tries to apply those if the system doesn't have that option, it will give an error saying it wasn't able to update. It can be found here.
Also as a note, just because the array says online it doesn't mean that it is bootable or that the data is intact. You could try to boot a windows install cd and see if you can do a repair, I would think that marking 2 drives as DDD and pulling them at different times, wiped out the arrays. I would test the hard drives, this can be done in the mini-config pressing ctrl-I when prompted and going to disks, and run the media test. If tests ok after the firmware update, either attempt to repair the OS install or lay it down again.
 
I would think that marking 2 drives as DDD and pulling them at different times, wiped out the arrays."

Exactly, but it was kind of confusing to understand the drive/array setup from the description. It depends on:

A. Were the drives pulled from the same mirror pair (if different pairs, then no big deal. The drive would be "degraded")
B. Was the operating system on them or just data (bootable or not)

By the way, now is the time to ask for a new server. Power supply failure must be eminent or worse.
 
User received i9990301 DISK FAILURE OR DISK RESET FAILED error
User pulled physical drive 0 (bottom-most drive)and rebooted.
Physical drive 0 is first drive in logical disk 1 / array A.

User states prompted to mark pulled drive defunct and did so.
User got i9990301 error.
User replaced drive 0 and pulled drive 1 (other drive in logical disk 1 / array A).
User states prompted to mark pulled drive defunct and did so.
User got i9990301 error."

The fact that it is a raid 1, and that there are 2 drives in Array 1, and both were set to DDD, and then pulled, without a drive being set to online in that array, tells me array A is most likely gone. But we don't know if the OS resided on that array, but most likely yes.
 
You know what comes to my mind?? Why multiple arrays using RAID1. In other words:

2 mirrored drives/Array 1
2 mirrored drives/Array 2
2 mirrored drives/Array 3

A RAID 5 setup would have used less disks, given more redundancy (with a hot spare) and more disk capacity too.

3 sets of 2 mirrored drives each = approx 450GB

RAID 5 with 6 disks (4 drives in array + one for overhead + hot spare) = 600GB
 
I agree. I don't understand a lot of the thinking behind a lot of the setups I see. Like 6 drives, with 2 arrays spanning the same 6 drives in a raid 6. why not do one array, raid 6 and then make 2 partitions? It would be simpler.
 
Here's the reason: We weren't there, so we don't know how it evolved. Maybe they got 2 drives to start with.
 
You guys are great - sorry I disappeared while repairing the system this week. After getting the drives back in place and working, I had a 'creation date mismatch' on the primary database used by the company (the database file in question resided on the rebuild array).

Anyway, everything back and running after database restore.

As for the RAID qeustions I originally posed here . . .

The OS (win serv 2003) was NOT on the failed array - only data. :)

Both drives of the failed array were pulled before I arrived :-(

I determined that it was drive 1 that actually failed by inserting each drive separately and attempting to mark it online. This wouldn't work for drive 1 - only drive 0.

Once drive 0 was online, the array was online but degraded. I inserted drive 1 while in SR Manager and it promptly started a rebuild. Once completely, SR Manager showed all drives, all arrays online and in good shape.

HOWEVER, I was still getting the i9990301 error on boot.

I made sure all the drivers and BIOS versions were matched (they were not previously, but it was working). I made sure the server BIOS was up to 1.17 but still no dice.

I powered off, pulled the controller card, booted into BIOS, reset all to factory default, shutdown, inserted controller card, booted and still same error.

I shutdown, pulled controller card, booted to BIOS flashed the BIOS to 1.17, shutdown, installed the controller card, powered up into the BIOS, made sure all settings were correct, powered down, booted as normal . . . and . . . after all startup sequences . . . windows server splash screen and normal bootup - no errors (until I found database creation date mismatch later but that's another forum :).

I can't really say why the drive 'failed' or why the rebuild now shows it as fine (I've informed them they need to at the very least purchase another drive ASAP - and yes, I recommended a whole new server).

I don't know why the boot was showing the i9990301 error or why the odd sequence of removing controller, flashing BIOS and reinstalling helped - but it did.

I don't know why they had different versions for the controller firmware, bios, and device drivers or why it was working previously, but it can't hurt to have it updated.

And for all the final questions regarding why 3 arrays of RAID 1, I have no idea. I know they had 6 drives from the start b/c the original invoice is neatly folded and attached to the side of the server (sweet :). And they have four (yep 4) identical servers sitting in their rack, all setup witht he same 3 arrays of RAID 1.

I do know the vendor for their proprietary database software did the setup and they have the OS on the one array with the primary database software and the dataset, then they have an automated backup of the dataset on the second array, and yet a third version of the dataset on the third array (they failed on me). All I can figure is they have some reason that their software needs or wants with that setup. It was originally setup that way in 2006, then the vendor came out and did a fresh setup in December 2009 (installing larger hard drives all around) but used the exact same setup of 3 RAID 1 arrays.

I hope this thread helps someone else in the future - thanks all for the help!
 
Thanks for responding. NEW SERVER should be in their future, but if they're tight with money, at least keep some spare drives in stock. As old as those drives are, more are likely to fail SOON.

"The OS (win serv 2003) was NOT on the failed array - only data. smile"
I'm not sure this finding would make me smile

"Both drives of the failed array were pulled before I arrived sad"
You should tell them not to be pulling drives without explicit direction from a qualified source (you). That's very dangerous.
 
Goombawaho:

Should I be concerned that the OS was NOT on the failed array, and why?

On another note, I had an interesting conversation with one of the owners (the one who pulled the drives) and during the conversation he asked about installing SATA drives in this server as a way to get more data space at a lower price. I discussed the various reasons why the current SCSI disks were more reliable even if more costly and smaller than commonly available SATA drives. I discussed rpm speeds and data access.

However, I had to acknolwedge (at least to myself) that it was not completely crazy to consider installing a PCI SATA card with RAID controller and a SATA rack in the 5.25 bays. I've seen a 6-port (2 esata and 4 internal SATA headers) PCI card and I've seen several racks that hold 3 sata drives and install in the space of two 5.25 bays (leaving the third bay for a cd/dvd optical drive). This would allow a RAID 5 setup on the three disks.

I haven't given it a lot of thought, but my iniitial impression is that this wouldn't be best as a complete replacement for the SCSI setup with the OS and primary data.

However, this might be a good on-site backup option. I'm thinking the rack goes for about $100 and another $30 for the PCI card, followed by three 2TB drives at under $100 each would give a pretty nice backup location to store multiple backup versions (using whatever flavor of backup software you prefer). While tape or off-site would still be needed for disaster protection, this SATA HDD approach would give a nice option to retrieve things like accidentally deleted files or other corrupt file issues.

Then again, maybe an external NAS with RAID 1 across two drives would be cheaper.
 
Should I be concerned that the OS was NOT on the failed array, and why?"

I was just thinking to myself - not sure that I would prefer the DATA drives to be lost vs. the operating system drives. In other words, if you DIDN'T have any backup, I'd rather have had the O.S. drives fail. Never mind - you're all past those issues. It was more an internal thought in my head.

My thoughts on changing configuration:
If I were you, I would encourage them to do it this way.
A. Move to some kind of new RAID setup (external or internal) to house their data only.

B. Then take the former data (internal) mirrored drives and make them all hot spares for the O.S. mirror. For a safety net.

OR (more work for you)

Part A above and then move the O.S. to a RAID 5 using all the leftover drives after part A above.

Lots of possibilities. But stress to them that the server is getting old and a power supply failure could be right around the corner, etc.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top