powerHA Dead Man Switch

svnhunix · Jun 22, 2010

Hello

After an upgrade from HACMP5.4 to 5.5 I did some cluster acceptation tests.
Doing so I ran into a problem.
One test is to "halt -q" one node (of a two node cluster)
Normally the second node should take over the resources. Instead of doing that it crashed with 888 error code.

In the errorlog I find this:
F48137AC 0619073010 U O minidump COMPRESSED MINIMAL DUMP
225E3B63 0619073010 T S PANIC SOFTWARE PROGRAM ABNORMALLY TERMINATED
9DBCFDEE 0619073110 T O errdemon ERROR LOGGING TURNED ON
AB59ABFF 0619071010 U U LIBLVM Remote node Concurrent Volume Group fail
90EDB0A5 0619071010 P S topsvcs Dead Man Switch being allowed to expire.
173C787F 0619071010 I S topsvcs Possible malfunction on local adapter

So, it is because of the dead man switch timer the node went down.

Trying to solve this, I see some recommendations about setting the syncd frequency to 10 instead of 60.
Seems that the upgrade resetted the frequency back to 60.

I can't test this out but can anyone explain to me how this works. Why can the syncd frequency give problems when I halt another node. What is the link between those?

best regards
Steven

sjm2 · Jun 25, 2010

If the PowerHA daemons on a node cannot get to a cpu in order to send heartbeat packets then other nodes will think that this node is dead and will begin recovery. In order that the recovery proceeds as intended this node must be taken down. Hence the dead man switch (DMS). If the PowerHA daemons cannot get to a cpu then they cannot send heartbeats but also they cannot reset the DMS timer.

The syncd frequency is one possible reason that PowerHA daemons might not get to a cpu before DMS timeout. If a lot of I/O needs to be done by syncd, for example because a lot of file pages are marked dirty, then that disk I/O will take place at a higher priority that the PowerHA daemons and so may "starve" the daemons of cpu. Decreasing the time between syncd runs allows less I.O to build up and therefore decreases the likelihood that the DMS will timeout.

DukeSSD · Jun 25, 2010

You need to get the whole 888 code, 888-10x-0cx.
If the dump failed you need to sort that out so the node can dump.
If the dump was good, call IBM and get it analysed.
If the dump was bad, sort out the dump failure then do the test again so you have a good dump for analysis.
Then take a snap -e from either node and a snap -ac from the failed node and send them off to IBM.
If you want to try and understand the problem yourself then you need to check the hacmp logs, the errpt will not let you now what the cause was.

svnhunix · Jun 29, 2010

IBM found the solution:

IZ66768: SERIAL NETWORK THREAD MONITORING ERROR CAN CAUSE DMS TIMEOUT DURING HACMP FAILOVER

So need to update the rsct version that came with AIX53TL11.

Thanks for the help
Steven

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

powerHA Dead Man Switch

svnhunix

Technical User

sjm2

Vendor

DukeSSD

Technical User

svnhunix

Technical User

Similar threads

Part and Inventory Search

Sponsor