Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HACMP NIM thread blocked

Status
Not open for further replies.
Jul 28, 2004
726
BE
Hi all,

we've go the following problem in our cluster:

the errorreport of node A is filling , making about 5 entries every 4 minutes,giving the following errors :
3C81E43F 0124092205 P U topsvcs Late in sending heartbeat
4FDB3BA1 0124092205 I S topsvcs DeadMan Switch (DMS) close to trigger
864D2CE3 0124092205 P S topsvcs NIM thread blocked

when I do an lssrc -ls topsvcs , this is what I get for tmssa :

NIM's PID: 520390
tmssa_0 [ 2] 2 2 S 255.255.2.0 255.255.2.1
tmssa_0 [ 2] ssa2 0x81f3500c 0x81f35021
HB Interval = 2 secs. Sensitivity = 5 missed beats
Missed HBs: Total: 33 Current group: 18
Packets sent : 4185480 ICMP 0 Errors: 5060 No mbuf: 0
Packets received: 4120068 ICMP 0 Dropped: 0


the number of errors strangely stays the same, hasn't changed in hours, even though the errorreport keeps filling with errors.On node B from the cluster, I'm getting the following :

NIM's PID: 340218
tmssa_0 [ 2] 2 2 S 255.255.2.1 255.255.2.1
tmssa_0 [ 2] ssa1 0x81f3501b 0x81f35021
HB Interval = 2 secs. Sensitivity = 5 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 66959 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 69394 ICMP 0 Dropped: 0


Here there are no errors.We haven't got any open links, and the load on the system isn't any different to any other day.
The machines are running HACMP 5.1 on an AIX5.2
Perhaps I must mention the following :

Yesterday node B was moved to node A for maintenance, and afterwards reacquired, but this went without any problems.It is since that moment that the entries started to come on node A.

Any suggestions for this are welcome

thx in advance,

greetz,

RMGBelgium
 
I was told to check my tuning parameters:

NOTE: This should be done on all nodes in the cluster and they will need to be rebooted for the change to take effect.
1. smitty chgsys
Set the high water mark to 33
Set the low water mark to 24

2. vi /sbin/rc.boot
look for the following line(s)
nohup /usr/sbin/syncd 60 > /dev/null 2>&1 &
change it to the following
nohup /usr/sbin/syncd 10 > /dev/null 2>&1 &

3. smitty hacmp
cluster configuration
cluster topology
configure network modules
change/show network modules (what do they use? ethernet, token ring?)
change failure detection rate to slow

4. - comment the pmd line in /etc/inittab
 
hi ,
ALSO you can check
if you are missing hearbeats , it sounds as if there may be a problem with the ssa heartbeat

How are your SSA disks connected from one server to the other ?

Check the SSA that you don't have any orange flashing lights

if possible i.e. bring your cluster down try using the /dev/tmssa?.tm and im devices to cat from one server and read from the other .

HTH
 
Hi all,

I've received the answer from the IBM lab in Mainz :

it seems to be a bug ... once again...Had to update the rsct.* filesets to fix it

thx for the responses anyway !

greetz

R.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top