Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

DELL PE1600SC Raid 5 failed - now what?

Status
Not open for further replies.

Curler

IS-IT--Management
Oct 27, 2002
89
0
0
CA
One of the drives went offline 3 days after a motherboard replacement because the PowerVault 110T had been replaced three times since December. This was to replace the on board SCSI controller.

The controller rebuilt the drive with Dell phone help.

Some corrupted files were deleted when trying to bring up Windows 2000 Server but we copied them to the SYSTEM folder from the DLLCACHE folder and got Windows running for the accounting application that runs there. But, you guessed it, Backup Exec won't run because the Net Logon Service won't run. This software stuff isn't supported by Dell.

My issue right now is I have no tape backup and the Task Scheduler won't fire at night to take a snap of my accounting data. I run that batch daily then copy the snap files to another system. I'm at the point now where I suspect either registry or security file issues and support folks are nervous about touching this for fear I'll lose what I have.

I propose to build another temporary server with the Win 2000 CD, configure it like the failed unit and copy/install the SQLBASE and data files to it and see if I can get the accounting running there. When that is done, I'll gen the Dell box from scratch and repeat this process.

Does anyone see another alternative?
 
How about a repair install. Generally fixes a multitude of sins, with no damage. The only time I have had problems with an over the top repair is if there was major corruption to the system, as in when the server won't even start. I would definitely get the data across to another machine, will the SQL maintenance backup up the databases ?

........................................
Chernobyl disaster..a must see pictorial
 
SQLBASE does provide tools to take a snapshot of the databases in use and I am able to run that batch to copy all of the data to another server. This is the same batch that the Task Scheduler is supposed to run every night to allow the tape backup to copy all of the accounting data.

I did run the SFC System File Checker utility already. I haven't done a 'Repair Install' before but I suspect it won't touch the registry or any security related files. Fortunately, this server is not the domain controller so at least I don't have that problem. As an update to my earlier idea I'm now thinking of installing SQLBASE on the domain controller and pushing the data to it. This will be a faster way to get the system running but means I'll have to change the V: drive mapping on all the workstations.

I'll give you another issue that is related to this. The system is supposed to have internet access (Windows updates and McAfee updates) but that access isn't available until someone logs in and uses the data inside the SQLBASE files. After that log in, the Internet access is available. The only message is an Event about a bad packet but that seems to trigger something (possibly part of Net Logon - that won't or Workstation - that might not be complete) that starts a required service.

As an aside, I've heard more about Raid 5 failures in the last couple of days than I wanted to. I'm questioning the use of Raid 5 because too often it seems to be accompanied by a Windows abort. At least Raid 1 gives me two chances to recover data from a drive - service bureau for the dead drive or the second drive. Raid 5 seems to be either recovery by the controller in the array or nothing AND the likelihood of failure increases with the number of drives. What are your thoughts on this? I have discussed this with three other small business veterans. What is the point of Raid 5 if you have the tape backup anyway? I know mirroring isn't perfect either.
 
Raid 5's safety is coming into question with larger raid array. It is not so much the number of drives but the number of blocks; not only does the safety of raid 5 come into question with increasing drive numbers but also with the capacity size of the drives involved. More blocks, more chances of an array offlining or failing, especially once an array becomes degraded.


Looks like raid 6 or an equivalent will become a standard, especially since raid adapter copressor speeds are increasing nicely. Personally, if the client has the resources, I am really pushing raid 10 with a global hotspare, especially when databases are involved (for safety and the speed increase).

To defend raid 5, I have an extremely low failure rate, but do not have scsi arrays beyond beyond 600Gig in size..

My client's arrays are maintained with plenty of air flow, generally I maintain the drives a few dredges above room temperature, most server room are at 75 to 80 F, servers are dusted out regularly. Almost all the arrays have a hotspare.

Where some problems begin...

I am really starting to question Dell's reliability. With larger installations I generally build servers from scratch, Supermicro motherboard, drive chassis, Seagate drives, Lsilogic cards, dual power supplies etc..no array failures, some are >7 years old (12 servers), obviously a few drive failures, and offlines drives.

With no hotspares, chances are an average array will be in degraded mode at least 24 hours; even a 4 hour service contract is too long.

Most raid users do not have much experience resurrecting arrays which can be brought back, panicky techs do things which sign the death warrant.

Firmware is not updated, or as in Dell's case, a faulty firmware is not corrected; I flash Perc adapters to Lsilogic firmware when the come in. About 5 years back Dell had a problem with their modified firmware which was not corrected for at least two upgrades.

Perhaps it is luck, but my standard retail purchased drives seem to have a lower failure rate than OEM drives, same make/model Seagates.

What I am seeing more of in posts lately... arrays which fail due to a disk which has a problem, the raid adapter is not failing them or offlining them, the problem disks are causing other disks in the array to fail or offline. At some point the problem disk fails a disk, then the raid adapter finally finds a bad block on the disk, and the array fails completely. This is one of the nastiest problems, as raid adapter diagnostic or even standard scsi adapter testing of the drives find no problems. When the disk causing the problem is found/replaced, the array stabilizes; I have found no utility which can find these nasty offenders, a major nightmare. To compound the problem, these disks are returned under warranty, recertified, then sent out to customers as replacement disks.


........................................
Chernobyl disaster..a must see pictorial
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top