Errpt: NIM thread blocked

MoreFeo · Sep 30, 2008

Hi,

We've got some HACMP clusters, and we're seeing a lot of this errors in errpt:

Code:

IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
3D32B80D   0919220308 P S topsvcs        NIM thread blocked

and we're seeing a lot of Missed HBs.

The clusters are on LPARs on 2 p5 550.

I've been searching in internet, but haven't found a solution. Iv'e found some APARs, but they're related to AIX 5.1 or 5.2 (

http://www-01.ibm.com/support/docview.wss?uid=isg1IY38496

or

http://www-01.ibm.com/support/docview.wss?uid=isg1IY56353),

and we're on AIX 5.3.

When this errors started to show there was no heartbeat over disk defined. We saw this could be the cause so we've defined a non-ip network for diskhb on hdisk4, only used for diskhb, but it hasn't solved the problem.

Has anyone gone thru this with a solution?

The configuration is AIX 5.3, HACMP 5.3, TCPIP heartbeat and diskhb heartbeat.

Code:

# oslevel -s
5300-06-04-0748
# lslpp -L cluster*
  Fileset                      Level  State  Type  Description (Uninstaller)
  ----------------------------------------------------------------------------
  cluster.adt.es.client.include
                             5.3.0.0    C     F    ES Client Include Files
  cluster.adt.es.client.samples.clinfo
                             5.3.0.0    C     F    ES Client CLINFO Samples
  cluster.adt.es.client.samples.clstat
                             5.3.0.0    C     F    ES Client Clstat Samples
  cluster.adt.es.client.samples.libcl
                             5.3.0.0    C     F    ES Client LIBCL Samples
  cluster.adt.es.java.demo.monitor
                             5.3.0.0    C     F    ES Web Based Monitor Demo
  cluster.es.client.lib      5.3.0.2    C     F    ES Client Libraries
  cluster.es.client.rte      5.3.0.3    C     F    ES Client Runtime
  cluster.es.client.utils    5.3.0.1    C     F    ES Client Utilities
  cluster.es.client.wsm      5.3.0.2    C     F    Web based Smit
  cluster.es.clvm.rte        5.3.0.0    C     F    ES for AIX Concurrent Access
  cluster.es.cspoc.cmds      5.3.0.3    C     F    ES CSPOC Commands
  cluster.es.cspoc.dsh       5.3.0.0    C     F    ES CSPOC dsh
  cluster.es.cspoc.rte       5.3.0.3    C     F    ES CSPOC Runtime Commands
  cluster.es.plugins.dhcp    5.3.0.0    C     F    ES Plugins - dhcp
  cluster.es.plugins.dns     5.3.0.0    C     F    ES Plugins - Name Server
  cluster.es.plugins.printserver
                             5.3.0.0    C     F    ES Plugins - Print Server
  cluster.es.server.cfgast   5.3.0.0    C     F    ES Two-Node Configuration
                                                   Assistant
  cluster.es.server.diag     5.3.0.3    C     F    ES Server Diags
  cluster.es.server.events   5.3.0.3    C     F    ES Server Events
  cluster.es.server.rte      5.3.0.3    C     F    ES Base Server Runtime
  cluster.es.server.testtool
                             5.3.0.2    C     F    ES Cluster Test Tool
  cluster.es.server.utils    5.3.0.4    C     F    ES Server Utilities
  cluster.es.worksheets      5.3.0.1    C     F    Online Planning Worksheets
  cluster.license            5.3.0.0    C     F    HACMP Electronic License
  cluster.man.en_US.es.data  5.3.0.1    C     F    ES Man Pages - U.S. English
  cluster.msg.en_US.cspoc    5.3.0.0    C     F    HACMP CSPOC Messages - U.S.
                                                   English
  cluster.msg.en_US.es.client
                             5.3.0.0    C     F    ES Client Messages - U.S.
                                                   English
  cluster.msg.en_US.es.server
                             5.3.0.1    C     F    ES Recovery Driver Messages -
                                                   U.S. English


State codes:
 A -- Applied.
 B -- Broken.
 C -- Committed.
 E -- EFIX Locked.
 O -- Obsolete.  (partially migrated to newer version)
 ? -- Inconsistent State...Run lppchk -v.

Type codes:
 F -- Installp Fileset
 P -- Product
 C -- Component
 T -- Feature
 R -- RPM Package
# lssrc -ls topsvcs
Subsystem         Group            PID     Status
 topsvcs          topsvcs          463096  active
Network Name   Indx Defd  Mbrs  St   Adapter ID      Group ID
net_ether_01_0 [ 0] 2     2     S    3.3.3.1         3.3.3.2
net_ether_01_0 [ 0] en1              0x401247c9      0x40124949
HB Interval = 2.000 secs. Sensitivity = 12 missed beats
Missed HBs: Total: 95 Current group: 39
Packets sent    : 6801168 ICMP 69 Errors: 0 No mbuf: 0
Packets received: 12324475 ICMP 139 Dropped: 44
NIM's PID: 344068
net_ether_01_1 [ 1] 2     2     S    3.3.4.1         3.3.4.2
net_ether_01_1 [ 1] en0              0x401247ca      0x40124949
HB Interval = 2.000 secs. Sensitivity = 12 missed beats
Missed HBs: Total: 81 Current group: 33
Packets sent    : 6801176 ICMP 53 Errors: 0 No mbuf: 0
Packets received: 12324472 ICMP 72 Dropped: 23
NIM's PID: 520434
diskhb_0       [ 2] 2     2     S    255.255.10.0    255.255.10.1
diskhb_0       [ 2] rhdisk4          0x801247c8      0x80cec110
HB Interval = 2.000 secs. Sensitivity = 4 missed beats
Missed HBs: Total: 7545 Current group: 597
Packets sent    : 6470498 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6811739 ICMP 0 Dropped: 0
NIM's PID: 561184
  2 locally connected Clients with PIDs:
haemd(409704) hagsd(446508)
  Dead Man Switch Enabled:
     reset interval = 1 seconds
     trip  interval = 48 seconds
  Client Heartbeating Disabled.
  Configuration Instance = 149
  Daemon employs no security
  Segments pinned: Text Data.
  Text segment size: 809 KB. Static data segment size: 1520 KB.
  Dynamic data segment size: 3969. Number of outstanding malloc: 218
  User time 487 sec. System time 507 sec.
  Number of page faults: 590. Process swapped out 0 times.
  Number of nodes up: 2. Number of nodes down: 0.

The errors are like:

Code:

LABEL:          TS_NIM_ERROR_STUCK_
IDENTIFIER:     3D32B80D

Date/Time:       Fri Sep 19 22:03:36 DFT 2008
Sequence Number: 35487
Machine Id:      000AAA8CD600
Node Id:         cl1_node1
Class:           S
Type:            PERM
Resource Name:   topsvcs

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.18,5919
ERROR ID
6BUfAx.MK.p6/c7X0I4.e.1...................
REFERENCE CODE

Thread which was blocked
send thread
Interval in seconds during which process was blocked
           8
Interface name
rhdisk4

The Interface name is almost allways rhdisk4 (used for diskhb) but sometimes it's en0 or en1.
In this cluster the nodes have dedicated internal disks and dedicated HBAs and ethernet adapters, and we're seeing the same errors on another cluster (in the same p5 550 servers) with virtualized disks and ethernet (thru VIO Server).

In one of the nodes we have also seen some "Late in sending heartbeat" errors in errpt:

Code:

LABEL:          TS_LATEHB_PE
IDENTIFIER:     3C81E43F

Date/Time:       Fri Sep 26 07:09:13 CDT 2008
Sequence Number: 61055
Machine Id:      000AAA62D600
Node Id:         cl2_node1
Class:           U
Type:            PERF
Resource Name:   topsvcs
Resource Class:  NONE
Resource Type:   NONE
Location:
VPD:

Description
Late in sending heartbeat

Probable Causes
Heavy CPU load
Severe physical memory shortage
Heavy I/O activities

Failure Causes
Daemon can not get required system resource

        Recommended Actions
        Reduce the system load

Detail Data
DETECTING MODULE
rsct,bootstrp.C,1.190,4576
ERROR ID
6zESUw.d1Br6/3JJ.M4.e.1...................
REFERENCE CODE

A heartbeat is late by the following number of seconds
          40

One the cluster runs Oracle, and the other runs WAS. The load is not too heavy and the paging space is low (about 25% of 8704MB for Oracle, and about 15% of 1536MB for WAS).

Does anyone know how to troubleshoot this, or know a solution?

Thanks.

MoreFeo · Sep 30, 2008

I've seen something more on this topic.

I've seen that xnptd is running on our servers, but they're not syncronized.

So on one server I've stopped the xntpd service and executed manually ntpdate against the time server (a windows server).

The server has syncronised the date/hour, and some new "NIM thread blocked" errors have just appeared in errpt:

Code:

# date
Tue Sep 30 15:37:52 DFT 2008
# errpt
# stopsrc -s xnptd
0513-044 The /usr/sbin/xntpd Subsystem was requested to stop.
# ntpdate ntp_server_ip
# 30 Sep 15:40:45 ntpdate[753858]: step time server 192.168.1.224 offset 107.380770
# date
Tue Sep 30 15:40:47 DFT 2008
# startsrc -s xntpd
0513-059 The xntpd Subsystem has been started. Subsystem PID is 598076.
# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
864D2CE3   0930154008 P S topsvcs        NIM thread blocked
3C81E43F   0930154008 P U topsvcs        Late in sending heartbeat

This is the /etc/ntp.conf file:

Code:

# @(#)48        1.2  src/tcpip/etc/ntp.conf, ntp, tcpip530 2/16/96 10:16:34
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# tcpip530 src/tcpip/etc/ntp.conf 1.2
#
# Licensed Materials - Property of IBM
#
# Restricted Materials of IBM
#
# (C) COPYRIGHT International Business Machines Corp. 1996
# All Rights Reserved
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# IBM_PROLOG_END_TAG
#
#   COMPONENT_NAME: ntp
#
#   FUNCTIONS: none
#
#   ORIGINS: 27,176
#
#
#   (C) COPYRIGHT International Business Machines Corp. 1996
#   All Rights Reserved
#   Licensed Materials - Property of IBM
#   US Government Users Restricted Rights - Use, duplication or
#   disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
#
#
#
# Default NTP configuration file.
#
#   Broadcast client, no authentication.
#
Servername ntp_server_ip
broadcastclient
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace

Has anyone seen this kind of errors with ntp?

Thanks.

ogniemi · Oct 1, 2008

I had similiar issue (not so many errors as you have but serveral per day) and got answer from IBM:

"...looks like a performance problem, something is running that is not allowing the nim threads to run, either CPU load or heavy paging"

Please look for a performance problem on you system...

unixfreak · Oct 2, 2008

When I've seen this in the past it's always looked like performance problems but it's also always been extremely unclear and I've never managed to resolve the issue or met anyone who has. On one occasion IBM claimed it was a bug but updating aoftware didn't fix it.

Annoying thing is it never seems to have anything to do with the load on the machine but that would be the first thing to monitor.

yisi · Oct 14, 2008

http://www-01.ibm.com/support/docview.wss?uid=isg1IZ02759

maybe it.

MoreFeo · Oct 14, 2008

Thanks, I will take a look at it, but it refers to AIX 5.2 and we're on 5.3, Anyway I'll give it a look.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Errpt: NIM thread blocked

MoreFeo

Technical User

MoreFeo

Technical User

ogniemi

Technical User

unixfreak

ISP

yisi

Instructor

MoreFeo

Technical User

Similar threads

Part and Inventory Search

Sponsor