Hi,
We've got some HACMP clusters, and we're seeing a lot of this errors in errpt:
and we're seeing a lot of Missed HBs.
The clusters are on LPARs on 2 p5 550.
I've been searching in internet, but haven't found a solution. Iv'e found some APARs, but they're related to AIX 5.1 or 5.2 ( or and we're on AIX 5.3.
When this errors started to show there was no heartbeat over disk defined. We saw this could be the cause so we've defined a non-ip network for diskhb on hdisk4, only used for diskhb, but it hasn't solved the problem.
Has anyone gone thru this with a solution?
The configuration is AIX 5.3, HACMP 5.3, TCPIP heartbeat and diskhb heartbeat.
The errors are like:
The Interface name is almost allways rhdisk4 (used for diskhb) but sometimes it's en0 or en1.
In this cluster the nodes have dedicated internal disks and dedicated HBAs and ethernet adapters, and we're seeing the same errors on another cluster (in the same p5 550 servers) with virtualized disks and ethernet (thru VIO Server).
In one of the nodes we have also seen some "Late in sending heartbeat" errors in errpt:
One the cluster runs Oracle, and the other runs WAS. The load is not too heavy and the paging space is low (about 25% of 8704MB for Oracle, and about 15% of 1536MB for WAS).
Does anyone know how to troubleshoot this, or know a solution?
Thanks.
We've got some HACMP clusters, and we're seeing a lot of this errors in errpt:
Code:
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
3D32B80D 0919220308 P S topsvcs NIM thread blocked
The clusters are on LPARs on 2 p5 550.
I've been searching in internet, but haven't found a solution. Iv'e found some APARs, but they're related to AIX 5.1 or 5.2 ( or and we're on AIX 5.3.
When this errors started to show there was no heartbeat over disk defined. We saw this could be the cause so we've defined a non-ip network for diskhb on hdisk4, only used for diskhb, but it hasn't solved the problem.
Has anyone gone thru this with a solution?
The configuration is AIX 5.3, HACMP 5.3, TCPIP heartbeat and diskhb heartbeat.
Code:
# oslevel -s
5300-06-04-0748
# lslpp -L cluster*
Fileset Level State Type Description (Uninstaller)
----------------------------------------------------------------------------
cluster.adt.es.client.include
5.3.0.0 C F ES Client Include Files
cluster.adt.es.client.samples.clinfo
5.3.0.0 C F ES Client CLINFO Samples
cluster.adt.es.client.samples.clstat
5.3.0.0 C F ES Client Clstat Samples
cluster.adt.es.client.samples.libcl
5.3.0.0 C F ES Client LIBCL Samples
cluster.adt.es.java.demo.monitor
5.3.0.0 C F ES Web Based Monitor Demo
cluster.es.client.lib 5.3.0.2 C F ES Client Libraries
cluster.es.client.rte 5.3.0.3 C F ES Client Runtime
cluster.es.client.utils 5.3.0.1 C F ES Client Utilities
cluster.es.client.wsm 5.3.0.2 C F Web based Smit
cluster.es.clvm.rte 5.3.0.0 C F ES for AIX Concurrent Access
cluster.es.cspoc.cmds 5.3.0.3 C F ES CSPOC Commands
cluster.es.cspoc.dsh 5.3.0.0 C F ES CSPOC dsh
cluster.es.cspoc.rte 5.3.0.3 C F ES CSPOC Runtime Commands
cluster.es.plugins.dhcp 5.3.0.0 C F ES Plugins - dhcp
cluster.es.plugins.dns 5.3.0.0 C F ES Plugins - Name Server
cluster.es.plugins.printserver
5.3.0.0 C F ES Plugins - Print Server
cluster.es.server.cfgast 5.3.0.0 C F ES Two-Node Configuration
Assistant
cluster.es.server.diag 5.3.0.3 C F ES Server Diags
cluster.es.server.events 5.3.0.3 C F ES Server Events
cluster.es.server.rte 5.3.0.3 C F ES Base Server Runtime
cluster.es.server.testtool
5.3.0.2 C F ES Cluster Test Tool
cluster.es.server.utils 5.3.0.4 C F ES Server Utilities
cluster.es.worksheets 5.3.0.1 C F Online Planning Worksheets
cluster.license 5.3.0.0 C F HACMP Electronic License
cluster.man.en_US.es.data 5.3.0.1 C F ES Man Pages - U.S. English
cluster.msg.en_US.cspoc 5.3.0.0 C F HACMP CSPOC Messages - U.S.
English
cluster.msg.en_US.es.client
5.3.0.0 C F ES Client Messages - U.S.
English
cluster.msg.en_US.es.server
5.3.0.1 C F ES Recovery Driver Messages -
U.S. English
State codes:
A -- Applied.
B -- Broken.
C -- Committed.
E -- EFIX Locked.
O -- Obsolete. (partially migrated to newer version)
? -- Inconsistent State...Run lppchk -v.
Type codes:
F -- Installp Fileset
P -- Product
C -- Component
T -- Feature
R -- RPM Package
# lssrc -ls topsvcs
Subsystem Group PID Status
topsvcs topsvcs 463096 active
Network Name Indx Defd Mbrs St Adapter ID Group ID
net_ether_01_0 [ 0] 2 2 S 3.3.3.1 3.3.3.2
net_ether_01_0 [ 0] en1 0x401247c9 0x40124949
HB Interval = 2.000 secs. Sensitivity = 12 missed beats
Missed HBs: Total: 95 Current group: 39
Packets sent : 6801168 ICMP 69 Errors: 0 No mbuf: 0
Packets received: 12324475 ICMP 139 Dropped: 44
NIM's PID: 344068
net_ether_01_1 [ 1] 2 2 S 3.3.4.1 3.3.4.2
net_ether_01_1 [ 1] en0 0x401247ca 0x40124949
HB Interval = 2.000 secs. Sensitivity = 12 missed beats
Missed HBs: Total: 81 Current group: 33
Packets sent : 6801176 ICMP 53 Errors: 0 No mbuf: 0
Packets received: 12324472 ICMP 72 Dropped: 23
NIM's PID: 520434
diskhb_0 [ 2] 2 2 S 255.255.10.0 255.255.10.1
diskhb_0 [ 2] rhdisk4 0x801247c8 0x80cec110
HB Interval = 2.000 secs. Sensitivity = 4 missed beats
Missed HBs: Total: 7545 Current group: 597
Packets sent : 6470498 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6811739 ICMP 0 Dropped: 0
NIM's PID: 561184
2 locally connected Clients with PIDs:
haemd(409704) hagsd(446508)
Dead Man Switch Enabled:
reset interval = 1 seconds
trip interval = 48 seconds
Client Heartbeating Disabled.
Configuration Instance = 149
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 809 KB. Static data segment size: 1520 KB.
Dynamic data segment size: 3969. Number of outstanding malloc: 218
User time 487 sec. System time 507 sec.
Number of page faults: 590. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
The errors are like:
Code:
LABEL: TS_NIM_ERROR_STUCK_
IDENTIFIER: 3D32B80D
Date/Time: Fri Sep 19 22:03:36 DFT 2008
Sequence Number: 35487
Machine Id: 000AAA8CD600
Node Id: cl1_node1
Class: S
Type: PERM
Resource Name: topsvcs
Description
NIM thread blocked
Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.18,5919
ERROR ID
6BUfAx.MK.p6/c7X0I4.e.1...................
REFERENCE CODE
Thread which was blocked
send thread
Interval in seconds during which process was blocked
8
Interface name
rhdisk4
The Interface name is almost allways rhdisk4 (used for diskhb) but sometimes it's en0 or en1.
In this cluster the nodes have dedicated internal disks and dedicated HBAs and ethernet adapters, and we're seeing the same errors on another cluster (in the same p5 550 servers) with virtualized disks and ethernet (thru VIO Server).
In one of the nodes we have also seen some "Late in sending heartbeat" errors in errpt:
Code:
LABEL: TS_LATEHB_PE
IDENTIFIER: 3C81E43F
Date/Time: Fri Sep 26 07:09:13 CDT 2008
Sequence Number: 61055
Machine Id: 000AAA62D600
Node Id: cl2_node1
Class: U
Type: PERF
Resource Name: topsvcs
Resource Class: NONE
Resource Type: NONE
Location:
VPD:
Description
Late in sending heartbeat
Probable Causes
Heavy CPU load
Severe physical memory shortage
Heavy I/O activities
Failure Causes
Daemon can not get required system resource
Recommended Actions
Reduce the system load
Detail Data
DETECTING MODULE
rsct,bootstrp.C,1.190,4576
ERROR ID
6zESUw.d1Br6/3JJ.M4.e.1...................
REFERENCE CODE
A heartbeat is late by the following number of seconds
40
One the cluster runs Oracle, and the other runs WAS. The load is not too heavy and the paging space is low (about 25% of 8704MB for Oracle, and about 15% of 1536MB for WAS).
Does anyone know how to troubleshoot this, or know a solution?
Thanks.