Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RAID5 Disk Failure

Status
Not open for further replies.

wcuz

MIS
Mar 22, 2002
54
0
0
US
I've got a PE1500SC with a 3 disk raid5 array. Disk 0:0 failed so I purchased an identical replacement drive. The rebuild on the new drive failed after a few minutes, and the Array Manager event log indicated an error on drive 0:1. I'm running diags. on drive 0:1 and they've stalled for just over an hour at 1% complete, so I'm assuming it's bad. So, now I've got a failed drive 0:0 and drive 0:1 online but in an error state. The array is still up on drives 0:2 and 0:1, but I'm thinking drive 0:1 is on it's way out. Does anyone know of a way to recover from this without having to rebuild the array and restore data? If not, can someone outline the steps to take to rebuild the array. The controller is a PERC 3/SC.

Thanks
 
If drive 0:1 is still working, I would image the server immediately. Then replace 0:0 and 0:1 and restore the image.

It's odd that you would see 2 drives fail so close to each other but it can happen. 2 bad drives on a 3 disk RAID 5 is fatal, no way to recover.



--
The stagehand's axiom: "Never lift what you can drag, never drag what you can roll, never roll what you can leave.
 
Unless you add a 4th drive and put it in as a hot spare---then you can lose 2 drives. I have seen a second drive fail like that when rebuilding---at times, it has been firmware issues (mostly with Compaq DL servers, scsi drives, but a few Dells, too). That could also be a PERC controller failing---I would image that server immediately like lawn says. There was a time that Dell went to third parties for the PERC 3Di and 4Di RAID controllers, and I think the Adaptec were junk...failed drives when you hot-swapped them.

Burt
 
C'mon Burt, at least say that Dell's implementation of Adaptec chipsets were junk. I've had nothing but good fortune from Adaptec-manufactured SCSI hardware for many years, they're my first choice for run-of-the-mill setups.

But even those old PERCs are better than Promise...


--
The stagehand's axiom: "Never lift what you can drag, never drag what you can roll, never roll what you can leave.
 
Thanks for the comments. This is a database server running sql 2000, so I'll probably copy the dbs to an existing sql 2005 box and attempt to rebuild the failed array. At least then I'll have the db apps. back in production and won't risk data loss. Once the pe1500 is back online I can re-assess.

Thanks again
 
Lawn---you are absolutely correct! I have had good history with Adaptec---it may not have been the Adaptec PERCs that were junk---I just know that was one of the companies, and I thought that it was the Adaptecs that were junk. It is only Dell's implementation of them---sorry for not clarifying.

Burt
 
I had a client's Exchange sevrer go through this issue a couple of months back. 1 disk failed and the rebuild failed as a bad block was detected on another disk. Server would boot but wasn't redundant, after much googling and conversations with Dell I had to accept the fact there was no choice but to backup, recreate the RAID array (with both dodgy disks replaced) and restore.

I did both an ARCserve and Windows backup just in case - ended up restoring using the Windows backup as it was less hassle.

Unless you can image the server you're going to have to re-install the OS as part of the restoration so make sure you note down the config like IP stuff before you wipe it :p
 
It's odd that you would see 2 drives fail so close to each other but it can happen"
No so uncommon, especially with the older raid adapters which do not have routines which check the disk surfaces containing no data/free space ( routine such as in Lsi's "Patrol Reads"). Most multiple drive failures in raids result from multiple errors found in unused areas of disks, too many for the controllers to handle at one time, as in during a rebuild.
I never understood why raid manufactures do not allow a continuation of a rebuild with multiple errors..there should be an option to accept the disk errors and continue in an emergency. Raid adapters fail arrays once an infinitesimally small area of the disk surfaces fail.. like letting a house burn down due to a fire in an ashtray.

"Unless you add a 4th drive and put it in as a hot spare---then you can lose 2 drives."

You still can not lose two drives, unless your running the newest raid implementations such as raid 6 . With a hotspare, a dive can be lost, the hotspare replaces the lost drive, but until the array completes a rebuild using the hotspare, a second drive can not be lost. Hotspare only offer a smaller rebuild window (generally)and automation of a rebuild.

Agree with Lawnboy image the disk, but do not trust the image completely, as the disk errors may cause the image restore to fail...also backup as Nick suggests.


........................................
Chernobyl disaster..a must see pictorial
 
technome said:
I never understood why raid manufactures do not allow a continuation of a rebuild with multiple errors..

Maybe because if 1 drive is taking the overhead to remap a sector, it could fall out of sync?

Just guessing.

--
The stagehand's axiom: "Never lift what you can drag, never drag what you can roll, never roll what you can leave.
 
True with the hot spare and RAID 6, as RAID 6 you can lose two drives simultaneously.
In his situation, the drives aren't trying to rebuild because the second drive has failed.

Burt
 
I never understood why raid manufactures do not allow a continuation of a rebuild with multiple errors.."

A more refined error correction mechanism needs to be implemented by the raid industry. The original design of raid considered arrays in the lower megabytes, say around 10 megabytes, not hundreds of Gigabytes. Have to admit before Patrol Reads I hit a number of multiple disk failures due to disk surface errors; in the last few years I count myself lucky, I have only had a couple failures which I chalk up to firmware either on the drives or raid adapter.

With the older raid adapters, incapable of running a complete surface check on a regular basis, areas of data are checked on the fly by the adapter, but only the data area; a consistency check can be run but again this only checks the area with data. The insane part is during a rebuild, the entire surfaces of the disks is checked, and there is a good chance the raid adapter will find more errors than it can handle. We would be labeled blatantly incompetent if we performed so poorly.
As a note I have used many SCSI disk test utilities on failed array disks, hanging off standard SCSI adapters, and they rarely find any errors even after running continuously for weeks.

Patrol Reads....




........................................
Chernobyl disaster..a must see pictorial
 
I have also found that sometimes on Compaqs (DL's), I can run a firmware update CD and disks that were downrevved and failed, as well as the RAID controller (mostly 5i), will be updated, and the disks "come back to life". This happened mainly with SCSI 72GB 10K U3, U320, as well as the same speed and throughput on 18GB, 36GB, and a few times 146GB. Mostly on 72GB drives...

Burt
 
Good point...
I would guesstimate 10-20% of "failures" are the result of firmware glitches, either at the disk or adapter level.


........................................
Chernobyl disaster..a must see pictorial
 
My experience, it was more than 1/2 (about 75%, maybe) of the 72GB U3 and U320 drives, but everything else, 10-2-% sounds right. I am not sure about any other servers, like Dell, but I would like to find out. I work on a lot of different servers, including HP3000, HP9000, DEC, VAX, Alphas, Sun, IBM (mostly X-series), Compaqs, and Dell.

Burt
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top