Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Failing DISK in RAID5 array or bad contoler, bay? 1

Status
Not open for further replies.

relikwie

IS-IT--Management
Mar 15, 2002
12
NL
Hello dear list,

I have a problem with a Compaq Proliant server.
It has an RAID5 configuration with 3 18.2GB disks, where one disk is failing all the time. Meaning:

The HDD gets disconnected and hte Array utility then warns that a disk has been lost power or connection cable and me to check this. But in fact it has power and firmly mounted. When removing the disk (hot swap) and re-mounting it, then utility then sees the disk and strats rebuiling. The disc is healkty again but only lasts for a short period before the disk ois not seen anymore.

I just want to find out if this is a faulty disk, RAID5 controler or the drive bay. So that I can replace the failing component.

I have to mention that we have lost power a week ago and the server had no UPS. The problems started from this point. Had a lot of STOP errors on with the NT4 machine.
When defragging the swapfile did help a lot, but just had another STOP error yetserday and I think it has to do with the failing disk.

Well, really hope somone can help me out.

Kind Regards,
relikwie
 
Been doing raid since '92, one failure since then, due to poor solder connection on a Mylex board in '95, unlikely the raid adapter. Backplane problems.. Dell had some a few years back; backplanes have few electronic parts, very little to go wrong with them.
Any time a problem occurs insure your CPU bios, raid and backplane firmware are up to date.

When a disk offlines multiple times, REPLACE the disk. Run a consistency check. If you were to test this disk with scsi disk utilities, odds are it will test out as healthy, generally raid adapters pick up disk problems which regular scsi adapters will not (simplified explanation).

Think the key words are "lost connection" , as the drive goes offline or is failed, believe the lost connection is in reference to drive communication, not power.

Best to get a UPS, as power utility companies do power switching in the wee hours of the morning and on weekends; sags, surges, and other power anomalies occur at these times. If a CPU power supply is not properly designed I have seen raid adapters affected by power problems.


........................................
Chernobyl disaster..a must see pictorial
 
what model

if seen this before on a few older compaq's
mainly 3000's

i did however see this on my an old 370 which i use for testing and it was just buttoned the other week and had this same problem

i'm pretty sure it will be the backplane
 
Hi,

thanks for your time to reply.

Just seen that I have made a post in the past on these forums,
with the exact same problem that I've had back then. I can remember that I had the same issue and that it was resolved by replacing the disk.

technome,
tried to follow you. I can see that RAID controllers wouldn't recognize some disks as healthy and a SCSI controller will. The firmware should be oke, since we have never had a problem. The machine run's a PDC based on NT4 6a, and has been doing this for 5 years. You say that a backplane can be ruled out and even the RAID5 controller. I have already a new drive from microdrive (a company that clones HP/Compaq disks). When this arrives I'll be sure if it is the disk..

Terry712,

its an ML350 generation T01. It's an old one (PIII 600, 128MB). Is there a way to be sure the backplane faults? Can I rule this out by some way?

Thanks,
relikwie
 
if it's the backplane - then a a general rule
the disk that has failed - if you remove it and then pop it right back in - it will work again for the same amount of time
 
See terry712 chimed in with a possible backplane issue.

As you can see you can not rule anything out.

I know of no better way to rule out the backplane.

For future reference, as to backplanes...( do not do this with your array!!!) with some new raid cards you can move a drive to a different slot on the same raid channel, with no problems, which would be a good test(drive roaming). With older technology cards, this would cause immediate, irreversible, array failure on startup.


........................................
Chernobyl disaster..a must see pictorial
 
Hi guys,

while waiting for the third disk to arrive, another one in the array crahsed/disconnected. So lost al data (have tried to recover, but the array kept saying the volume is not okay or the order of the disk is changed after putting the newly arrived one.

Well, have been busy recovering. Had just a PDC and no BDC and had to fetch sam en security hives from tape and do tricks with recovery disk etc to restore. We've had exactly one day of downtime but everything is ok now, recovered for almost 90%.

This server is out, don't trust it.

I want to say thanks to you for trying to help.

Now I am going to make a post here about what hardware/software to buy to create a trusty fileserver.
Was thinking about Windows 2000 / 2003 and going with a RAID5 setup again.

Thanks again.
 
Sounds like you hit one of the scariest, but fairly rare raid problems.

One of the worst raid problems I have had is when a disk with a problem, is not offlined or failed by an adapter, never shows errors on testing, either via the raid diagnostic or while connected to a standard scsi adapter for testing. The offending disk proceeds to cause other disks in the array to fail. The two times this has happened to me, the offending disk would intermittently fail a particular disk in the disk group. With the latest one, eventually the offending disk offlined (after months of causing another disk to fail/offline). Once replaced, the raids were OK, still running. A friend at the Minasi website had this happen this last week... only the offending disk failed a particular disk, and then errored out with a block error..he lost the array. I was luck both times, as the offending disk did not fail a disk and error out at the same time. I also maintain hotspares on arrays, which keeps the degraded mode down to a couple hours, which might of saved my butt.

Another server...
Would recommend 2003, with a raid 5 with 5 or 6 disks for performance. Three disk raid 5 arrays are VERY slow, a 5 or 6 disk u320 will give you almost double the performance, properly setup. For safety I would recommend a hotspare setup, as degraded arrays are susceptible to failure due to a bad block showing up while degraded. 5 or 6 disk is the maximum number before a u320 scsi bus becomes saturated, thus no more performance gains. As a note I worked on a Dell with a PERC4E DI ( equal to an lsilogic u320-2 PCI Express) adapter, with 3 disks ( not my choice), very nice. The adapter has a coprocessor running at 600MHz. I compared it to my Lsilogic u320-2x PCI-X ,5 disk array..
my array beat it in raid 5, but only by 20%. If the client's array had 5 disks, it would have crushed my setup. Raid 1 on the clients machine was almost twice as fast as my raid 1, the client has Fujitsu 15k disks, which beat Seagate's 15k disks, as the technology is more recent.

I advocate just using one large raid 5 array for the OS and data, no raid 1 for the OS, if the array has 5 or 6 drives.. with two partitions, one for the OS and one for data. I do NOT recommend it but if you only use a 3 drive raid 5 for data then I would advocate a raid 1 for the OS.

........................................
Chernobyl disaster..a must see pictorial
 
Hi technome,

yes, I must admit that I do not know the ins&outs of server technology and just hanging on bits and pieces of knowledge.

I was going with a Dell Poweredge 1800, and want to ask you a few quesyion about how to configure this. You say that 5 or 6 SCSI HDD's in a RAID5 config is the best way to go. I like that, redundancy and performance. But how should I understand this. Like 4 disks making a volume and 1 for parity? (want to go with 5 73GB-10000RPM disks, as I want it to be in budget). Also I can choose between diff RAID controllers:

1) PERC4/DC RAID Controller 128MB Cache
2) PERC4/SC RAID Controller 64MB Cache
3) PERC4e/DC RAID Controller 128MB Cache (channels: 1xINT,1xEXT)

I would go for the second one, because of the pricing, the third one is also possible(just 200 euro's more). But just don't understand the diffs. Could you tell me?

Then back-up software, we have a license for Backup-Exec 8.0 and I think this will do just fine. Don't know how Dell's "TapeWare" is doing. If it's eaqual to Veritas's I'll consider ordering it too.

Then there is Dell's management solution via hardware DRAC4.
Is this usefull/handy? If it includes just things that I also can do locally without DRAC4 then I guess I don't need it.

And then the Windows OS, whitch I also don't know exactly what to do about. These issues trouble my mind atm:

1) have chosen for (forced to, because 32 CPU's are not listed) a 64bit Xeon CPU. Now Windows 2003 64bit will run ok on it. But can there be trouble with drivers/software etc?

2) looked up the differences betweeen 2003 standard and enterprise and saw that enterprise supports Xeon processors, where in the table of standard version this is not to be found. Seems a bit strange to me.

3) I have 40 CAL's for NT4, is it possible to upgrade these to windows 2003 server? :) Guess not, but if you happen to know :)

These are some bunch of Q's, but hope somone can clue me about these issues. I also have to start looking what AD is all about. Since this is new for me. Have succesfully administrated NT4 for 5 years. Didn't learn much though about other things. Since the rest of the servers run here are unix/linux and have put my time completely in understanding this.

Kind regards,
relikwie
 
I see the 1800 does not have the option for the Perc4ei.
As usual It difficult to find the specs at the website.. it is dual channel OEM from LSIlogic equal to the Lsilogic u320-2E ( could possibly be the single channel version), except the card is embedded, and uses the motherboard SCSI interfaces, other than the lack of SCSI on the controller, it is the same as as the Perc4E DC, but a lot cheaper. The power edge 2800 offers it. You will also want the 2x4 split backplane, so you can divide the drives as evenly as possible across both channels. Believe the Perc4EI comes with 256 cache as a default. The Perc4E DC is dual channel add in card, the SC is single channel. By using a single channel with the drives, 5 drives you will not saturate the scsi bus, but with 6 drives you will saturate the bus..basically with the single channel you are at the limit of the controller, any more drives or added array sets will not add throughput, and your arrays will not get the best performance. With a dual channel you could have up to 10 drives before saturation is an issue , 5 drives with the SC.

With the embedded adapter you will need a separate addin scsi adapter to run the tape drive, the tape cannot be attached to the onboard scsi interfaces.

If you get the addin card, Perc4E DC, order it with 64 Meg, purchase a larger RECOMMENDED memory stick from 3rd party. The memory MUST be from the recommended list or you can have big problems.

Veritas 8.0 is fine, why waste money if you have the license and the experience with Veritas.

DRAC, again no spec, but it is probably another OEM LSI. Heard mixed feeling on using DRACs. If nothing else, being able to reboot a stalled server remotely is great. If remote rebooting is not critical the 2 free Terminal server connections included with WK2 and Wk3 is a great tool, create a remote connection and you can do just about anything to a server, the remote console is just like being seated in front of a server, you can reboot a server, but not if it is truly frozen.

Yes you may have a problem with 64 bit, until all software is rewritten, don't hold you breath. Could you clarify this statement # 1). Major gain to 64 bit is memory access, if your planning to have tons of memory; if your on a budget I don't think this will happen.

Standard, Enterprise, EM64T both versions support Xeons

Not sure about the CALs, but I believe not, check.

For Active Directory, and server setup get Mark Minasi's Mastering Windows 2003 by Sybex ( great book, crappy binding). If you did NT4, and have worked on XP machines, you can handle Win2003 with some learning..you will like it, stable, better performance, plenty of new toys to play with.

........................................
Chernobyl disaster..a must see pictorial
 
Forgot to answer this...


"I like that, redundancy and performance. But how should I understand this. Like 4 disks making a volume and 1 for parity?"

Raid 5 distributes the parity across all disks involved in a set of disk in an array, evenly. A couple raid types do have dedicated disk(s) for parity, never ran into those types.

"Didn't learn much though about other things. Since the rest of the servers run here are unix/linux and have put my time completely in understanding this."
We are all on the same ship, never enough time to keep up, and we all must divide our time..as is I know absolutely NOTHING about Unix, wish I had some experience.

The Perc4 DC I believe is the LSIlogic u320-2x, the same as my raid adapter. The adapter is optimal run on a PCI-X bus at 133 MHz, has a 400 MHz coprocessor. (the Perc4, a couple years back was the Lsilogic u320-2 64MHz PCI)

This is repeated above
Perc4E, runs on PCI-Express bus, with a 600MHz coproessor. The motherboard bus still limits both adapters to a max of a bit over 1 Gig throughput, but the extra 200 MHz co processor speed of the Perc4E, does speed up raid 5 parity creation, thus the array. If possible go for the PCI express version, either the Perc4E DC or Perc4E DI.

Again the Perc4 DC and DI are basically the same, technically the Perc4 DC should be faster as the controller has SCSI interfaces onboard the card, the data which flows between the raid chips and the inboard scsi interface does not travel on the motherboard. Does it make a difference, maybe a very small difference. The Perc4E DC also has both internal and external scsi connectors, the Perc4e DI only has internal connectors..no big deal, especially since it is a few hundred dollars less.


Lastly, give the 2800 model a good look, I saw a report which gave the 2800 a high rating,( benchmarks). The client's Dell with the Perc4E DI is a 2800, it is a bit noisy, well built. As far as the processor speed, I would not get the absolute newest CPU speed, Dell is a rip, on the newest CPU pricing, go with the next lowest, my client's 2800 has a 3.2 Xeon and it is bloody fast.

Any more question or clarifications, fire away. Excuse me, as I sometimes have a hard time explaining certain aspects.


........................................
Chernobyl disaster..a must see pictorial
 
Hi technome,

thanks for your time again.

Have been busy today so couldn't react sooner.

I have chosen to go with the 2800.
including a 5 disks raid5 and a raid1 based on two disks for the OS. The RAID controller is "PERC4e/DC U320 RAID Controller (128MB cache) (channels:2xINT/0xEXT)" and also an extra SCSI controller for the tape unit. I have thought to achieve much greater redundancy, to also get a simple poweredge (800 or so) to get it replicate with the 2800. Don't know if this is a wise idea or not.

Windows 2003 server has an option, someone told me. To replicate realtime with a second server.

I have an idea of how AD works. It should make managing XP workstations really easy. Especially pushing out packages and registry manipulating. There is a lot to read and learn there for me. But I'am having fun with it already.

Hope the server will arrive soon, so I can start setting it up. I plan to migrate a user at a time. And to leave the others connected to the temporary solution (an *old* workstation that is PDC now).

Well, heading home sweet home to enjoy a hot wheatered week-end. Same to you.


-relikwie
 
An 800..
Not a bad idea, a cheap server as part of a backup plan, you can get away with two scsi or two SATA drives mirrored, low amount of ram. I use the same system at client sites, kick ass server for the FSMO, and a cheapy as a secondary DC. As a secondary DC which the FSMO replicates AD info to, I mirror as a DC is important and rebuild is more involved then a member server, not nearly as involved as an FSMO though.
You probably know already but a tape backup on an FSMO is critical, FSMO problems are involved.

"Windows 2003 server has an option, someone told me. To replicate realtime with a second server"
At what system cost! File replication at present is at the file level,for the most part, bandwidth intensive, a joke! But wait, MS is releasing distributed file system replication (DFSR) soon, block level replication. Only changes within a file are replicated, the way it should be.

Look into Executive software's "Undelete", I consider a MUST have, "system restore" is limited. Diskeeper for automated defrag.

With the setup you are getting you will not believe the speed difference!!!!!!!!!!!!!!
Good you went for the Perc4E DC, would be my choice, but I thought you were on a tighter budget.
The 18 Gig drives were miserably slow in raid setups.. on a couple of arrays, just going to the 36 Gig drive range more than doubled the speed. New server rough speed increase.. something is seriously wrong if you do not get >5X the speed on the array setup. Once you get it, I can give you the OS/raid parameters for optimum speed if you want. One of the main thing you want to do is turn off SMB signing, if it is within your network security parameters, and play with "flow control" on the network interface. On the client's 2800 server, the network performance was beyond terrible, until flow control was disabled, it depends on your network equipement.

Same, hot in N.Y.C

........................................
Chernobyl disaster..a must see pictorial
 
Hi technome,

hope you are still around.
I've been very busy with helpdesking, patching and solving issues. While the new server unpacked for weeks, I finaly sat for it. Meanwhile reading up on AD etc. First the server came
with all HDD's as one volume RAID0. While I ordered it RAID1+5. Well, doesn't matter, gives me the opp. to do it myself and know how it is done. Ctrl+M came in handy and made 2 HDD's RAID1, 4 HDD's RAID5 and one HS. The gui needs some ATTN. but works. There reinstalled w2k3, dcpromo, DNS setup. All fine. Then don't want to use ordinary shares I setup DFS.
It works (As you mentioned above, there is FRS that does "file-incremental"? replication - must say in the unix realm there has always been the famous rsync protocol that one can use on windows). Havn't had any problems setting up, AD, OU's etc. Still am brandnew to this, but catching up steadily. One big problem I have is with the server and OS being 64bit. Lack of printer drivers and one important tool that doesn't work on x64, gpmc. Blaim MS for this, because I don't know how to edit GPO's that are linked in OU's.

Do you know of a tool (commercial/free) that can replace gpmc?
Also, DFS should eliminate the use of "\\servername" usage, but this won't work on joined clients. On the server I can map drives with "\\domainname\..". But clients don't. I think I have misconfigured somthing or do not understand things.

Well, have set it up and it is live for a couple of users, not all. If I need to change somthing I have to do it now.
Still issues about how to do drivermapping's (vbscript, kix, batch).


Any tips would be great.

relikwie
 
Your doing well

You are correct DFS does bock level replication. Sorry have not set it up in a while

Did you run DcDiag and NetDiag commands with the /V switch, and correct any errors ?

DNS.. in your forwarders list, add a DNS server which does not belong to your ISP, just in case, your ISP changes IPs or their servers go down..happened to me a couple of times
Check off "Do not use recursion" in the forwarder tab, keeps DNS a little more secure.

With the raid array sets, use the Amcli command to schedule weekly consistency checks through the windows scheduler, off hours if possible. This is one of the best protections for arrays, as it checks all the blocks on an array set for errors (and corrects them), which can build up over time leading to array failure if multiple read failures occur in a short period of time
Amcli.exe is within the Dell open manage directory
the command line amcli.exe /c1/e0/v should work for a volume designated for the first volume on an array, each volume changes the command line. Amcli /? gives help.
Amcli.exe should reside in C:\Program Files\Dell\SysMgt\Array Manager
Try to run multiple consistency checks on the volumes before you go into production. You might want to run Dell disk diags too,if possible.

Set up disk cleanup and defrag to run every day from the scheduler, better yet get Diskeeper or Perfect disk for the defrag, as the windows defrag is not that great, the ability to do a boot time defrag comes with the two defrag programs.

gpmc substitute, do not know of any, but there must be at a price. With gp setup take it slow and document. The GP editor could be a lot better, as it is not easy to find all the parameter which are changeable without digging down, nor does GP cover all the setting which affect servers and workstations.

If you have another low end server, think about install WSUS, a great patch updating tool. I would not put it on a fast server, because IIs eats memory, and complicates the server. Patch your servers manually, one at a time,with testing, as patching is becoming Patch and Pray in the last few months.

I install Spybot, Adaware, and Spywareblaster on servers.. I add the spybot host entries to the server hosts file.

Install Terminal services for remote administration, great tool, simple setup.. though I recommend access thru VPN

There are some great people on the Minasi.com forum for AD DFS questions etc.



........................................
Chernobyl disaster..a must see pictorial
 
Hi,
hmm. This forum needs a proper reply option, quoting text etc. Well, moved my attention to other issues at work and left the new server as it is. Have lot of work to do, because we are moving to another building I want to take this opp. to change network/server config. Just done 11Mbit WL coverage over the whole warehouse, want to extend this to the office too - saves patching and eating cat-cable. Hope boss does ACK on this. Also ordered a new low-budged dell server (3xSATA raid5) which will be BDC and maybe WSUS.

Just did start-run dcdiag & netdiag yesterday in a hurry. Seems no such tools on the system. Have to take a better look at it tomo. But you are right, forgot about checking disks and raid functionality, whitch is important. Know we had 2 AIX boxes where the SSA disks where checked every hour for errors.

Technome, I've had issues too with failing DNS machines at home. My ISP "@home" had these failing. After a couple of times, I asked an IRC chatfriend, who is service provider if I could use his DNS machines. Since then configured ip settings statically and never had trouble with it, util they changed my ip :). On the server I have setup 2 forward DNS's, whitch are geographically spread on our corp. network. And also added recursive zone, with secure dyn adding of client ip's. Still don't grasp the security here.
Meaning only clients on "DOMAIN" get added? I have used, still use 2 DHCPD servers that do fail-over and 2 BIND servers slave/master. Also with recursive zone and dynamic host update for the recursive zone. The security here is handled trough SSL, DHCPD server contacts DNS server with a key. My wish is to keep this dhcp/dns config, whitch are linux boxes, and make AD work with this. Seems possible, needs work and inderstanding. An issue for later.

Thanks for the amcli tip, will do that. Had to wipe Dell's pre-installation. So lost Dell openmanage tools. And, the disks shipped with the Dell server are not supported on x64.. Yawn, need to call tech supp. on this. You've mentioned the HW openmanage thingy on Dell servers and I ordered one wihtout it. Does this mean I can do this by software? Couldnt install the tools afterwards and have no idea of what openmanage offers and is. First tech-support.

Many times I;ve read or heard that ntfs doesn;t need defrag. Still don't know if its true or not? Should I really need it scheduled every day? Speed thingy?

gpmc is a must have, need to get it to work. Seems a path issue, because gp works and gpmc works. Only the execution of gp from within the gpmc mmc spwans an error msg. Found a workaround on MS site. But the paths they mention do no match and as I predicted the workaround doesnt work. Posted a qustion on a gpo site.

I have MS-SUS still running (don;t laugh: on a pentium 2 workstation along with EPolicy from McAfee) Damn slow, but works :) And will move to WSUS, must have. Will take care of patch testing too :)

Would I need anti-spy tools on a system that would't connect to the net?

I'll go and figure some things out these weeks.

Thanks for your time, Technome.
For a moment I thought you where dutch, cause saw a post of yours mentioning tweakers.net.


rel
 
Did not mean to condescend, in anyway in the upperposts..on the forums you do not know how much experience posters have.

BDCs are good insurance, I use the "cheapy" server route also.

the "support tools"

Dns security with "Do not use recursion"
This force queries to only go to the DNS servers listed in the Forwarder tab. If the check mark is not used, the DNS servers upon failing a query at your ISPs DNS server will be able to recursively query any DNS server on the Internet. Preferable you do not want this, as untrusted DNS server can introduce virus, intensionally or not, rare but possible. Worry not, decent sized ISP's DNS server do not fail queries.
Windows will live gracefully with BIND, though I have not run into any clients using it in years, I work on smaller networks, for the most part.

"Had to wipe Dell's pre-installation."... good rid-ens, I prefer to wipe the Dell pre-install out.. you end up with a much cleaner install with less problems. I reinstall only the absolutely necessary Dell garbage, I mean programs. I use LSilogics utilities Winrc or better still the global array manager GAM, Dell emulates this, the command line executables are different though.

Nothing wrong with SUS except for the limitations, I have only updated to WSUS on one server, no time to do other clients yet.

Antispy on server which do not connect to Internet.... Who knows what sinister malware/virus is out there, could jump to a server, also trusted sites do get infected.. I do use the server to download drivers and updates only. Mainly it is a precaution because some of my clients have accessed sites from the servers, some think they can admin a network... gets dangerous.

Dutch, no, but FemmeT at tweaker.net site has done some phenomenal raid testing, and has loads of test graphs.

........................................
Chernobyl disaster..a must see pictorial
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top