PE 600SC crashing/rebooting after power problems

shaferbus · Nov 29, 2010

We have a Dell PowerEdge 600SC running Small Business Server 2000 SP4 with a CERC ATA100/4ch RAID controller with 4 attached drives (RAID 5 + hot spare)

After almost a week of spontaneous lockups, shutdowns, and pulling out of hair, we determined that we had a defective UPS that apparently wasn't delivering sufficient voltage to the server. It eventually got to the point that I couldn't get a problem-free startup!

I won't go into detail about the troubleshooting process that lead us there (unless someone thinks it relevant), but I had lots of duplicate hardware, and now the only things that have not been replaced are the system RAM (I'm rotating out sticks each time it crashes to see if that has any effect), and the drives in the RAID array.

After plugging into a different UPS, the server will start as normal, and run for hours (3 - 12 hours), but will eventually and abruptly crash... of course when no one is there to read any blue-screen messages. According to the DrWatson log, there is always an Access Violation (c0000005), but NOT always in the same process.

Does anyone have any insight on this, or suggestions for my next step?

shaferbus · Nov 29, 2010

OK, more info...
After wading through the Dr. Watson log, it turns out that I was overly optimistic. Only in one instance does the Access Violation error coincide with the time of the "unexpected shutdown" listed in the system log! So, I don't even have that little bit of information.

It seems to me that this must be a hardware problem, because there isn't anything relevant in the event logs prior to crash.

My fear is that the bad UPS ruined my spare hardware too, since I did all of the hardware replacements before we thought of the UPS

First instinct was that even though the UPS was replaced, it's still a power problem, so I put a brand new PSU in, but no change.

shaferbus · Nov 30, 2010

Further...

Fortunately, I was at last able to get BSOD info from a couple of crashes. In each case, it was:

STOP: 0x00000077 (0xc0000185, 0xc0000185, 0x00000001, 0x0009d000)
KERNEL_STACK_INPAGE_ERROR

Although the BSOD said it was starting a dump of physical memory, there is no .dmp file to be found

The paging file IS large enough to handle the physical memory, but still no dump.

According to the MS KB article kb228753, the second parameter in the STOP message is the "I/O Status Code", which translates to:

"0xC0000185, or STATUS_IO_DEVICE_ERROR: improper termination or defective cabling of SCSI-based devices, or two devices attempting to use the same IRQ."

Remember, this is an IDE Raid controller, which just APPEARS to Windows to be SCSI, so there is no cable termination to check, but I replaced all of the drive cables anyway.

Going through the troubleshooting steps outlined in the MS document:
I scanned for boot sector virus with up-to-date McAfee
I reseated all components (except the processor)
I've run chkdsk /f /r on both virtual drives
I had one paging file on each virtual drive. I've removed the one on D:\. Tonight I will run another chkdsk on D:\ and then I'll reverse things and do the same on c:\. This should resolve any bad blocks in a paging file, shouldn't it???
I've got one more memory stick to swap out - so far no improvement.

I have yet to try disabling system caching in BIOS, but I don't hold out much hope, because why would a BIOS setting suddenly be causing errors if nothing else is defective?

So, according to the MS document, that brings us to a bad Mobo or drive controller in BOTH machines (assuming the memory and paging file stuff works out).

Does this seem likely? Does anyone have a best guess on which component to start hunting down?

Thanks...

shaferbus · Nov 30, 2010

Well isn't this nifty!

Just started the chkdsk /f /r of the D:\ volume as described above, to see if there are any bad blocks where the pagefile was located... and the RAID controller starts beeping away - failed drive!

Fortunately I have a hot spare in the array, and it's rebuilding, but that doesn't mean there's no stress with chkdsk AND a drive recovery running at the same time!

I'll post again with results...

shaferbus · Dec 1, 2010

It's Wednesday morning now, and so far, so good! For the first time in a week the server has been up for more than 24 hours without crashing.

The chkdsk results for the D:\ drive said:

"Correcting errors in the Volume Bitmap. Windows has made corrections to the file system", although it showed 0 kb in bad sectors.

The RAID controller has finished rebuilding the failed drive and shows the array as healthy. We'll see what happens.

What is puzzling me now is why the controller didn't fail the drive before this, and why during chkdsk scan. Does chkdsk skip the paging files like disk defragmenter does?

technome · Dec 12, 2010

When a disk electronics component start to fail you can get issues in which the raid adapter does not fail the drive or fails the drive after a period of time, unlike the adapter failing the drive for disk surface errors. As to drive electronic, a raid adapter will fail a drive with electronic issues under certain conditions, but especially with intermittent failures or components going out of spec, it may not.
The worst scenario with drive component failures is the offending drive can cause false failures of other drives in an array.

........................................
Chernobyl disaster..a must see pictorial

http://www.kiddofspeed.com/default.htm

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

PE 600SC crashing/rebooting after power problems

shaferbus

MIS

shaferbus

MIS

shaferbus

MIS

shaferbus

MIS

shaferbus

MIS

technome

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor