Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Mystery server "crash"

Status
Not open for further replies.

ILW

MIS
Jul 1, 2003
72
GB
Hi all. Got a puzzle for you - not much to work on I'm afraid.

NW5.1 SP6 running on Compaq ML370 (cutting edge stuff!)

Server has been stable for the last 5 years (no s/w updates for at least 2 years) build is identical to 3 others on same site & 80+ others worldwide.

3 times in the last 3 days (different times of day) any drives mapped to the server volumes have disconnected & it's not been possible to reconnect until after a reboot.

The server doesn't immediately drop from the network - typically ping and even Rconj work for 15-20 minutes after drives are disconnected, then they too fail. Nothing in the logs or from CIM.

We've run Compaq diags overnight - completely clean - and also reseated all cards/RAM.

May not be related, but we were getting the server hanging during backups of the Sys4 volume - excluded various folders from the backup without finding any paricular issue. Eventually switched that volume to Differential only backups (we kept the last 2 successful Full's very safe) and the problems stopped. As I say, may be unconnected.

So, any ideas what I try next? I'm trying to get the server written off but that's 6-8 weeks away at best. Thanks in advance.

Ian
 
A ML370 (G1 I presume) is more than enough to run NetWare 5.1.

Have you considered that the disconnections may be network related and not server related? It might be worth investigating items from the network card, cabling, switch port(s) etc... just to eliminate them from the issue.

When you say the server hangs, do you mean that it freezes and requires a reboot to resolve? Can you elaborate on this more?

Apart from thatand as a last resort, you could swap the processor and memory (not at the same time) with another of your servers to see if the problems move to the other server. Compaq diags might not pick anything up, but that may mean that any error does not occur when diags are run.

--------------------------------------
"Insert funny comment in here!"
--------------------------------------
 
I had a very similair problem in the past and it was hardware related, in my case it was a damaged scsi cable if I remember correctly.

I know you said you have run diags but I would still think of a hardware problem initially

Paul
MCSE


"Two things are infinite: the universe and human stupidity; and I'm not sure about the the universe."
Albert Einstein
 
fwiw it's actually an ML370 G2. As far as the hanging, as I said first anyone with a drive mapped to the server loses the connection & is unable to reconnect. The server still pings, and Rconj still works.

Then, as much as 15 or 20 minutes later (if we leave it that long) ping fails. At that time, if you type at the console, it never comes back to the prompt & needs a power reset.

I'm also thinking hardware, hence reseating the cards. Client SLA doesn't allow me to poach parts from other servers, but I'm trying to pursuade The Management that we need spares if we're going to diagnose/fix this.

Thanks for replies!
 
when you rcon it
what the processing etc like
is it getting crippled or just doing nothing?
 
You need to load up the Compaq agents (HP Proliant Support Pack) and make sure they are running. Then you may be able to get some diagnostic information when it hangs from some of the system messages (CPQIML.NLM)

It's also possible that if you are running very old agents, that could contribute. Should be up to date on these anyway. There's also a need for the last service pack for NW5.1.. SP6 had its share of problems and I'd have no problem going to SP8.

It really sounds like your processor on the server is pegging out, which is why client connections fail but pings still work. Pings are very simple packets while client connections are pretty complex.

Marvin Huffaker, MCNE
 
CPU does not go high - Monitor shows nothing out of the ordinary 5-10%. The agents are vintage Smartstart 5.50 - we try to keep them the same globally, and not touch unless broken! Marv - I hear your comments about SP8 & I'm pushing for it with no success (Yet). I will try CPQIML - not something I've used before. fwiw - CIM reports nothing during these fails.

Thanks again to all.
Ian
 
i would look at the agents first
i have seen this kind of thing before on some of the comapaq agents - especaially cpqdasa etc.

if it is dissapearing constantly like this and you are local - you could try remming the cpqhealth and cpqsnmp out or unloading them in the short term (as long as you monitor carefully - ie failed disks etc - no alerts)

you say you dont want to change much - which is fair enough - but the reality is - IT IS BROKEN

 
Terry, I know that, you know that, even The Management are starting to work it out ;-)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top