Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Losing Drive in RAID 5

Status
Not open for further replies.

John0616

Technical User
Dec 15, 2007
28
0
0
US
We have an HP ML350 G5 server, with an HP E200i Sata RAID controller. We have a RAID 5 Array with 3 1TB drives. For the past year or so about once per month one of the drives--usually drive 1, occasionally drive 2 will drop from the aray with no indication of failure. When looking at the server all of the LED's on the drive will simply be off. The server, itself appears to freeze up--or maybe more accurately just comes to a slow crawl. Generally to fix the problem we can simply remove the drive and replace it. The array then finds the drive and the machine conitinues most times not even rebuilding the array--as if nothing happened. We have replaced the controller, and several drives and the problem always seems to come back at some point. We have contacted HP support which has for the most part not been very helpful--suggesting various firmware updates which have helped some, but really have not fully resolved the problem, and really for the most part these folks feel that because we can resolve the problem by resetting the drive, there is no problem. Of course to me it is most annoying to have to come in on my days off, late at night, or at the crack of dawn to reset a drive. We have had several servers that have run 24/7 for years on end witout having a single issue. Has anyone seen this type of behavior before and have any idea how to correct this?
 
Repeated failures of downstream drives might lead me to believe that the 0 drive was the culprit particularly since you say the drive doesn't rebuild once reinserted. That seems to say to me that the controller is waiting for drive 0 to signal an operation was completed so that commands could then be executed to the next drive up in the command queue. If these are the only drives on that particular controller bus, we could implicate cable, backplane or controller issues. But, if others drives are shared on the bus (beyond the raid 5 group), it's most likely the 0 drive.
 
A couple of thoughts:

1. Power conditioning. Is the server just plugged into a wall outlet with no power conditioning or do you have something to "clean" up the power?

2. Do you have the latest versions of firmware installed on everything? Array controller, drives, BIOS, etc.

3. Have you had to reboot the server with one of the drives not responding? I'm curious to see what, if any, error messages you get on POST. How abous SIM? Do you have that running anywhere? If so, what does it tell you?



Light travels faster than sound. That's why some people appear bright until you hear them speak.
 
This Server is connected to an APC Smart-UPS 1000.

I Have downloaded and run the latest firmware update CD distributed by HP. which updated all firmware including the controller, and the BIOS.

The drives are all Seagate model ST31000340AS with firmware revision SD1A.

I have never HAD to reboot the server when any of the drives dropped, although I chose to do so on two occasions just to see what effect this would have.

I shut doen the machine and unplugged it for 2 minutes prior to restarting. The first time the machine started and the drive that had dropped came back on line. There were no errors reported in any of the windows system logs nor in the HP Proliant Integrated Log Viewer.

The second time I tried this procedure, the array failed the drive that had been dropped. I replaced the drive. The array rebuilt. The system ran fine for about a month until one morning I discovered that the machine had once again frozen up--The brand new drive had dropped from the array. I reset the drive by removing and resetting the drive. Everything was fine...a week later drive two dropped from the array...then a few weeks later drive 1 dropped...and so on...

In the meantime I have taken the drive that had previously failed and run every test on the drive that Sea Tools has to offer. the drive has passed every test. I went ahead and reformatted that drive, upgrading the firmware at the same time and used it to replace drive #3 which did legitimately fail recently(SMART was activated and it failed several read/write tests). It has been a few weeks now and so far no issues with that drive.

Most of my research on this issue seems to point to Firmware issues that many Seagate drives had experienced which from my understanding were supposed to have been fixed with the most recent firmware updates that were installed on these drives. For me however, I had a drive drop as recently as Sunday--A fine happy-4th-of-July gift which leaves me less satified that this is in fact the only culprit--unless of course the updated firmware is still not exactly right. At the same time it strikes me as odd that it is always the first drive in the array where the problem starts. That particular drive--the hardware, itself, has now been replaced twice, and it still exhibits the same behavior. the back plane on the array was also replaced by HP the very first time we had this issue--which I am pretty certain for this model is also the controller.

Sorry for the long story...hope this helps...
 
Agree with Maultier..
If you have repeated dropout on a particular drive, even after it replacement it is very like one of the remaining drives in the array which has not failed causing the drop out (considering the firmwares have been updated). Had this happen a number of times with a higher number of drives per array. It is extremely frustrating.
This happen with arrays which would randomly drop different drives or a particular drive (actually slot). At least it appears to effect a particular drive in your case, and you only have a few drives. In almost all my scenarios, I believe the controller electronics on one drives was the cause, an intermittent issue such as a component going off-spec . Could be a cable slot or backplane or raid card, but in all the cases I had, finding/replacing the offending drive resulted in a stable array.

"In the meantime I have taken the drive that had previously failed and run every test on the drive that Sea Tools has to offer."
In over 20 years of using utils to check out drives, I have yet to find one which will reliably find array drive errors; they will pickup obvious issues.
Had many drives which fail in arrays, tested continuously for weeks, never failing (hanging off a SCSI hba), only to fail once back in the array. The same drives placed on a standard SCSI controller would gives years of reliable service, go figure.
PS..this is why I never buy refurb or recert drives, as they are very likely failed/returned array drives which have been tested and sent out again. Good luck


........................................
Chernobyl disaster..a must see pictorial
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top