Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Array testing

Status
Not open for further replies.

teqmod

Technical User
Sep 13, 2004
303
0
0
US
Over the weekend we lost an array twice. Here are the specs of the system:

Dell PE 1750 w/onboard Perc and Raid 1 136GB drives
Powervault 220 Array enclosure
7 x 136GB HDD drives
Adaptec 2120S RAID controller

This machine is an exchange server

In the enclosure it showed we lost 4 drives all in the same 30 second period of time and we lost the array. It showed ID 9 12 13 14 were lost. We shutdown the system and reseated all drives. It came up with drive 12 failed and the array degraded. According to the log this was the first that failed when all drives went downWe reseated drive 12 again and the array rebuilt and was up and running like normal. 2 hours after the rebuild we lost drive 5 and when we went to the logs only drive 5 showed as failed. This drive has since been replaced and is currently in the process of rebuilding the array. Since this was multiple hardware failures we grabbed another temporary enclosure, attached it to the onboard RAID controller configured an array and pulled the exchange DBs off the problematic array and moved them to the temp array. The problem is now I do not know where to go to resolve the issue. I am not convinced it is a simple drive failure since it has always reported different drives. Since all of the DBs are not on the array in question anymore it seems fine but it also is just sitting there not doing anything. Does anyone know of any test software to do read/write tests on this array? Has anyone seen an issue like this before?

 
After using multiple diag programs on raid arrays over the last 20 years, I have yet to find one which works reliably on raid arrays, even if the individual drives are placed on standard drive interfaces for testing. Some diags may pickup bad sectors and will pickup very obvious drive failures, but you would be surprised how many drives will pass all tests, no errors, hanging off a standard disk interface in constant testing for weeks, only to fail once place back into an array. The only reliable testing is with a drive testing hardware device. That said, it could be anyone of the drives which has not been replaced.

This situation could be caused by a bug in the raid adapter firmware, so it should be the most up to date. Less likely, hard drive firmware, unless the drives are certain Seagates models with known issues. Different firmware revisions do not help the situation.

Reseat all cables.

Look at raid management software logs, any drive which has soft/hard errors is more likely an offending drive, then a drive which show no errors.

Pull the drives out with power off, make sure you know which slot each drive comes from, number them with a magic marker... any chips on the drive PCB boards which have abnormal hot spots? examine each drive PCBs carefully.



........................................
Chernobyl disaster..a must see pictorial
 
I agree in general with technome, but I'll go on to say that the failure mode makes a controller or enclosure related problem far more likely than a problem with one or more of your disks.
 
I am replacing the one drive just because. I am just not real confident in the array at this time and want to test it before I put it back in production. I was hoping there would be a test software or even a disk burn in software I could use through the array to see if any errors would come up. I am also considering just moving hte array off the Adaptec cars and to the onboard PERC card. This will at least take one of the possible components out of the picture.
 
The onboard Perc will not accept the array from the Adaptec, you may cause more problems by trying.
Dell has diags you can download from its' support site or from the disks you received upon purchase.

........................................
Chernobyl disaster..a must see pictorial
 
Currently there is no data on the array. Since it failed all data was moved to a seperate array that was temporarily attached. I was thinking of attaching the drive enclosure to the onboard controller and creating a new array and migrating the data to it.

Not sure if the dell diagnostics will work through the current adaptec. Getting it out of the picture might be a good idea just from this perspective.
 
Sounds like a good move, literally. Might be a good to create the array, let the initialization take place, and repeat the process a number of times; this will stress the drives, heating up the the electronics/drives..at least if it is going to fail again, you have spent very little time in finding out.

........................................
Chernobyl disaster..a must see pictorial
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top