server goes unresponsive,kybd dead 1

ITsmyfault · Nov 2, 2004

I have a situation going on where various servers at various times (but always during backup) will go unresponsive. They don't abend - they sit out at a console screen, cursor blinking.. but they can't be pinged, cluster services sees them as failed and I lose keyboard on them (can't set numlock, caps lock, etc) so I cannot take a coredump via the debugger. I have to power them off to clear this condition.

Q is: has anyone seen a situation like this and would your guess be software or hardware as the possible culprit? I've been chasing software, but I am starting to wonder.

TheLad · Nov 3, 2004

I would go for hardware as well, possibly memory, NIC or processor. Might be worth starting with reseating as strangely this has worked for me in the past as some dust or other crap has settled itself on the components and caused this sort of issue.

Also, what version including Service Pack are you running and what is your Backup software?

-----------------------------------------------------
"It's true, its damn true!"
-----------------------------------------------------

Lou0686 · Nov 3, 2004

Hi,

I would agree with TheLad if only one server was affected but since multiple servers are affected and it happens during backup I would first look to the backup agents running on the servers.

Same querstion as above, what backup software?

Lou

summoner · Nov 3, 2004

Are you using a UPS on these systems? Dirty power can also cause these symptoms. If these systems are on UPS, you might want to verify that the UPS is outputting clean stable power.

ITsmyfault · Nov 3, 2004

Thanks all for the responses. To answer your questions:
Backup app is Syncsort Backup Express 2.1.5D
Server OS is Netware 6.5 SP2 with a group of subsequent patches (8732 DS patch, WSock6e, tp657ha)
Servers are IBM x345. 3 node cluster and 2 node cluster.
switches are McData. Fabric is 2 GB fiber
storage is IBM FastT600. (LSI)
tape library is IBM 3582-LTO (rebranded adic library) which is on the fabric.

Things run great, but then blow up between 1 and 3 weeks. Sometimes the unresponsive server only takes itself out, sometimes it takes the cluster out.

Started working with memory fragmentation originally, and this seems to be solved. Fragmented memory is 0-1% of total whenever I check it. We've also been looking a lot at cluster communications, done a lot of work on the nics, ecb settings, tcp patches, etc. Also looked at fiberchannel zoning (which appears to be fine).

It's been hard to track down what exactly is going on as the backup is lan free and all the servers share the tape library so it's not a clear kind of thing where the server who is backing up the resource is the one that always goes.. there are 14 cluster resources on file & print, and 6 on the other (groupwise) cluster. But it is starting to look like it happens more often to a server who has the tape drive mounted and is actively backing up a volume - although this has not always been the case (I think). Biggest issue for me is not being able to get a coredump..

Thanks again for the ideas and direction. Much appreciated.

marvhuffaker · Nov 5, 2004

I've seen hardware AND software issues cause this type of problem. Although my first hunch is hardware, I can't really remember the specific issues I've seen. It's not something you see very often. What about the temperature of the system? are the fans in the system all operational? I have seen fluctuations in cpu temp cause probs.

Marvin Huffaker MCNE, CNE
Marvin Huffaker Consulting

http://www.redjuju.com

ITsmyfault · Nov 5, 2004

temps are good, room is a steady 68 deg (or less usually).
We're looking for an NMI board.. know of anyone who still sells them? so far google is not helping. In the old days you used 'periscope' boards to generate a non-maskable interrupt (nmi) to free up the CPU from whatever it was doing. This often gives you back console to you can get to the debugger (by abending the OS) In the really old days you could short the right two pins together on the CPU and get the same effect.

P4's are not backwards compatible in this way afaik.
We're also looking at zoning the san even though CV says you don't need to zone in netware.. seperating the tape from the storage and keeping the hba's from seeing each other. There is some concern that if one HBA asks for a LIP reset (for ex) it could affect the other HBA's in unpredicable ways.. I dunno.
tonight I'm flashing the server/diag/ism firmware, then picking a friday night and flashing all the SAN components (HBA's, FastT, drives, etc) and see where that gets us.

marvhuffaker · Dec 15, 2004

Itsmyfault, did you resolve this?

I ran into a similar case last week with NW6SP5 and the CE1000b.LAN driver.. Server would come up, and after the NIC would load, the server was fine. But as soon as you tried to do anything from the server (monitor, inetcfg, etc) the server would lock tight and would be totally unresponsive. Had to power off.

Backrev'd the driver and that fixed it.

Marvin Huffaker MCNE, CNE
Marvin Huffaker Consulting

http://www.redjuju.com

ITsmyfault · Dec 16, 2004

Hi Marv:

We *think* we've found it. Still waiting to see how we do. Old problem would happen inside ~24 day window so it will be a few weeks before we really know. Right now it appears that it may have been an issue with one module of the backup app we're using and the way we're using it. If this turns out to be the case, we'll be one of 2 clients who experienced it. Yeah, I feel special.

The app uses shared memory a particular way and if that goes bad, the server will hang big time. We even tried to generate a software NMI (multi-cpu Netware servers can do this via ACPI) but that would not even work. (!)
We tried a *lot* of different things.. been working on this since April!! We even re-zoned the SAN to single-initiator (was one big zone) and seperated tape HBA's from disk HBA's.
Did all the san firmware & drivers.. played with lots of SET parameters on the Netware side, switched from IBM MPIO to Netware MPIO also.. roughest issue I've ever seen honestly.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

server goes unresponsive,kybd dead 1

ITsmyfault

IS-IT--Management

TheLad

Technical User

Lou0686

MIS

summoner

Technical User

ITsmyfault

IS-IT--Management

marvhuffaker

MIS

ITsmyfault

IS-IT--Management

marvhuffaker

MIS

ITsmyfault

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor