Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HACMP 5.2 - strange probem

Status
Not open for further replies.

ogniemi

Technical User
Nov 7, 2003
1,041
PL
Hi,
I have a problem with starting cluster services on second node when first node have already activated HACMP.
When I stop HACMP on first node then I am able start cluster on second node and it gets all
resource groups as is should. Have anyone met such behaviour of HACMP 5.2?
Resource/topology synchronization completes successfully. There are two service addreses and two resource
groups in the cluster configuation. In normal cluster operation both resource group should be active
on its own node. Because currently I am able to start HACMP on one node only at the same time,
that is why I both resource group are being started on one node only.



Below error comes during try of starting of HACMP on second node:


Starting Cluster Services on node: clnode02
This may take a few minutes. Please wait...
clnode02: start_cluster: Starting HACMP
clnode02: Changes detected on Node .
clnode02: Oct 11 2004 11:55:15 Starting execution of /usr/es/sbin/cluster/etc/rc.cluster
clnode02: with parameters: -boot -N -b
clnode02:
clnode02: odmget: Cannot open class HACMPcluster
clnode02: odmget: Cannot open class HACMPcluster
clnode02: odmget: Cannot open class HACMPtopsvcs
clnode02: Oct 11 2004 11:55:16
clnode02: rc.cluster: Error: Changes have been made to the Cluster Topology or Resource
clnode02: configuration. The Cluster Configuration must be synchronized before
clnode02: starting Cluster Services.
clnode02: cl_rsh had exit code = 1, see cspoc.log and/or clcomd.log for more information


It is not true that there "Changes have been made to the Cluster Topology or Resource" because no
changes chave been made after last synchronization and each additional sync completes with success
not solving the problem.


It is not true that "odmget: Cannot open class" because running "odmget HACMPcluster", "odmget HACMPtopsvcs"
completes with 0.

# odmget HACMPtopsvcs

HACMPtopsvcs:
hbInterval = 1
fibrillateCount = 4
runFixedPri = 1
fixedPriLevel = 38
tsLogLength = 5000
gsLogLength = 5000
instanceNum = 102
# echo $?
0


Currently HACMP is started on one node only and the resource status is:

# clfindres
-----------------------------------------------------------------------------
Group Name Type State Location
-----------------------------------------------------------------------------
cl_group02 non-concurrent OFFLINE clnode02
ONLINE clnode01

cl_group01 non-concurrent ONLINE clnode01
OFFLINE clnode02


Policy for both RG are:

Resource Group Name cl_group01
Participating Nodes (Default Node Priority) clnode01 clnode02

Startup Policy Online On First Available Node
Fallover Policy Fallover To Next Priority Node In The List
Fallback Policy Fallback To Higher Priority Node In The List





Resource Group Name cl_group02
Participating Nodes (Default Node Priority) clnode02 clnode01

Startup Policy Online On First Available Node
Fallover Policy Fallover To Next Priority Node In The List
Fallback Policy Fallback To Higher Priority Node In The List


Have anyone idea how to solve it?

brgrds,
r,m.
 
Are the rhosts file configured correctly?
From memory, HACMP relies on rhosts/rsh to communicate between the nodes.
 
According to your clfindres output, both RGs are on clnode01. I guess it is because you have startup policy of "Online On First Available Node". Have you tried moving the cl_group02 to clnode02 manually, via CSPOC menu?
 
I will recommend you to shutdown HACMP on both nodes....
check that rlogin to both nodes work ok.... then reboot both nodes. and then resynch topology & resourses and then start HACMP Services...
probably both nodes are not able to communicate with each other so check rlogin and rcp working as expected or not



Here comes polani Once again!!!

P690 Certified Specailist
HACMP & AIX Certified Specailist
AIX & HACCMP Instructor
 
Hi,
thx for reply.

ggitlin: The rhosts file is created properly (/usr/sbin/cluster/etc/rhosts). I had also created /.rhosts file (no necessary in HACMP 5.2) but is doesn't solve the problem (rsh commands completes succesfully in both directions on cluster nodes).

polani: there are no problem with communication between cluster nodes.

Before HACMP started on node1:

clnode01 # rsh clnode02 date
Wed Oct 20 08:04:33 CUT 2004

clnode02 # rsh clnode01 date
Wed Oct 20 08:04:54 CUT 2004



After HACMP started on node1 (both resource group/service addresses are configured properly on node1)

clnode01 # rsh clnode02 date
Wed Oct 20 08:08:43 CUT 2004

(stanby on clnode02 is not accessible from clnode01 now because there are no standby subnet on active node - both ents on clnode01 are configured with service IPs using IPAT via IP replacement)

clnode02 # rsh clnode01_serv date
Wed Oct 20 08:06:51 CUT 2004

clnode02 # rsh clnode02_serv date
Wed Oct 20 08:07:39 CUT 2004


The only communication lines between cl nodes when resources are online on clnode01 are:

clnode01: clnode01_serv, clnode02_serv
clnode02: clnode02_boot

...and also TMSSA shoud be used as the communication line but it is not because "network_down -1 tgmssa_net" occures every time cluster resources are being configured.

=======================================================
LOG START
=======================================================
+ exit 0
HACMP Event Summary
Event: /usr/es/sbin/cluster/events/check_for_site_up_complete clnode01
Start time: Tue Oct 19 16:01:16 2004

End time: Tue Oct 19 16:01:16 2004

Action: Resource: Script Name:
----------------------------------------------------------------------------
No resources changed as a result of this event
----------------------------------------------------------------------------

Oct 19 16:01:18 EVENT START: network_down -1 tgmssa_net

:network_down[62] [[ high = high ]]
:network_down[62] version=1.22
:network_down[63] :network_down[63] cl_get_path
HA_DIR=es
:network_down[65] [ 2 -ne 2 ]
:network_down[77] :network_down[77] cl_rrmethods2call net_cleanup
:cl_rrmethods2call[49] [[ high = high ]]
:cl_rrmethods2call[49] version=1.10
:cl_rrmethods2call[50] :cl_rrmethods2call[50] cl_get_path
HA_DIR=es
:cl_rrmethods2call[63] :cl_rrmethods2call[63] odmget -qname=tgmssa_net HACMPnetwork
:cl_rrmethods2call[63] egrep nimname
:cl_rrmethods2call[63] awk {print $3}
:cl_rrmethods2call[63] sed s/"//g
RRNET=tmssa
:cl_rrmethods2call[63] [[ tmssa = Geo_Primary ]]
:cl_rrmethods2call[70] :cl_rrmethods2call[70] odmget -qtype=2 HACMPrresmethods
:cl_rrmethods2call[70] egrep net_cleanup =
:cl_rrmethods2call[70] sed s/"//g
:cl_rrmethods2call[70] awk {print $3}
RRMETHODS=
:cl_rrmethods2call[72] echo
:cl_rrmethods2call[73] exit 0
METHODS=
:network_down[91] set -u
:network_down[104] exit 0
Oct 19 16:01:18 EVENT COMPLETED: network_down -1 tgmssa_net

HACMP Event Summary
Event: network_down -1 tgmssa_net
Start time: Tue Oct 19 16:01:18 2004

=======================================================
LOG STOP
=======================================================

Has anyone idea why tmssa network goes down??? When HACMP is stopped TMSSA tests are successfull but after HACMP is started on clnode01 tmssa is busy:

# ls -al /dev/tm*
c-wx------ 1 root system 42, 0 Oct 19 15:56 /dev/tmssa101.im
cr-x------ 1 root system 42, 1 Oct 19 15:56 /dev/tmssa101.tm
cr--r--r-- 1 root system 42,65535 Oct 20 07:45 /dev/tmssako
# cat </dev/tmssa102.tm
The requested resource is busy.
ksh: /dev/tmssa102.tm: 0403-016 Cannot find or open the file.
# echo testtesttest >/dev/tmssa102.im
The requested resource is busy.
ksh: /dev/tmssa102.im: 0403-005 Cannot create the specified file.


When i run it on clnode02 I get:

# cat </dev/tmssa101.tm
LLLLLLLLLLLL


Any idea why tgmssa "network_down"???


I also get during synchronization following warning which I can't solve because I don't get why netmon.cf is linked with TMSSA in this warning:

=================================
WARNING: File 'netmon.cf' is missing or empty on node clnode02. This file is needed for a cluster with the single-adapt
er network tgmssa_net. Please create 'netmon.cf' file on node clnode02 as described in 'HACMP Planning and Installation
Guide'.
=====================================


And here is my network configuration:

# cllsif
Adapter Type Network Net Type Attribute Node IP Address Hardware Address Interface Name Global Name Netmask

clnode01_boot boot ether_net ether public clnode01 192.168.32.240 en0 255.255.255.0
clnode01_serv service ether_net ether public clnode01 192.168.32.241 255.255.255.0
clnode01_stby standby ether_net ether public clnode01 172.16.32.240 en1 255.255.255.0
tmssa01 service tgmssa_net tmssa serial clnode01 /dev/tmssa102
clnode02_boot boot ether_net ether public clnode02 192.168.32.242 en0 255.255.255.0
clnode02_serv service ether_net ether public clnode02 192.168.32.243 255.255.255.0
clnode02_stby standby ether_net ether public clnode02 172.16.32.242 en1 255.255.255.0
tmssa02 service tgmssa_net tmssa serial clnode02 /dev/tmssa101



Your help appreciated very much,

r,m.
 
...and here is the configuration of communication interfaces:

? ?
? Node / Network ?
? Interface/Device IP Label/Device Path IP Address ?
? ?
? ?
? clnode01 / ether_net ?
? en0 clnode01_boot 192.168.32.240 ?
? en1 clnode01_stby 172.16.32.240 ?
? ?
? clnode01 / tgmssa_net ?
? (none) tmssa01 /dev/tmssa102 ?
? ?
? clnode02 / ether_net ?
? en0 clnode02_boot 192.168.32.242 ?
? en1 clnode02_stby 172.16.32.242 ?
? ?
? clnode02 / tgmssa_net ?
? (none) tmssa02 /dev/tmssa101


r,m.
 
hi ,
if i understand you correctly ,
the target mode ssa when started will show up as busy because it is heartbeating over the SSA disks as well as th e network cards . when you stop HACMP , you can manually do the tests i.e. the cat ssa tests successfully . This is normal .
Also after network down occurs , does it also say tgmssa_net
network up ?

Regarding your other problem , it seems as if clcomdES which is used to talk to the other server when configuring , discovering the resources has problems , even though you can rlogin to the other server on all interfaces

when you setup the server , and ran a Discovery were there any errors reported i.e. something like cannot communicate with second node ? even though sync returns a success are there any warnings listed ?

is clcomdES daemon running on both nodes ?

you can try running clcomdES with tracing on and you sync
remove old log files in /var/hacmp/clcomd

then run

stop:
/usr/es/sbin/cluster/utilities/clcomd_ctrl -n

Turn on tracing:
/usr/es/sbin/cluster/utilities/clcomd_ctrl -t

start:
/usr/es/sbin/cluster/utilities/clcomd_ctrl -S

Then do a sync and check log files

HTH
 
DSMARWAY , thx for you reply.

here are some infos:

1. clcomdES

root@clnode01 /dev/pts/1 /
# lssrc -s clcomdES
Subsystem Group PID Status
clcomdES clcomdES 98480 active


root@clnode02 /dev/pts/0 /
# lssrc -s clcomdES
Subsystem Group PID Status
clcomdES clcomdES 69776 active


2. Cluster started on clnode01: status

root@clnode01 /dev/pts/1 /
# clfindres
-----------------------------------------------------------------------------
Group Name Type State Location
-----------------------------------------------------------------------------
APPL2 non-concurrent OFFLINE clnode02
ONLINE clnode01

APPL1 non-concurrent ONLINE clnode01
OFFLINE clnode02


3. /tmp/cm.log on clnode01

# cat /tmp/cm.log
Oct 21 08:44:16 EVENT START: node_up clnode01
Oct 21 08:44:18 EVENT START: node_up_local
Oct 21 08:44:18 EVENT START: acquire_service_addr clnode01_serv
Oct 21 08:44:26 EVENT START: acquire_aconn_service en0 ether_net
Oct 21 08:44:26 EVENT START: swap_aconn_protocols en0 en1
Oct 21 08:44:26 EVENT COMPLETED: swap_aconn_protocols en0 en1
Oct 21 08:44:26 EVENT COMPLETED: acquire_aconn_service en0 ether_net
Oct 21 08:45:12 EVENT COMPLETED: acquire_service_addr clnode01_serv
Oct 21 08:45:12 EVENT START: get_disk_vg_fs
Oct 21 08:45:12 EVENT COMPLETED: get_disk_vg_fs
Oct 21 08:45:12 EVENT COMPLETED: node_up_local
Oct 21 08:45:12 EVENT START: node_up_local
Oct 21 08:45:13 EVENT START: acquire_takeover_addr clnode02_serv
Oct 21 08:45:21 EVENT COMPLETED: acquire_takeover_addr clnode02_serv
Oct 21 08:45:21 EVENT START: get_disk_vg_fs
Oct 21 08:45:21 EVENT COMPLETED: get_disk_vg_fs
Oct 21 08:45:21 EVENT COMPLETED: node_up_local
Oct 21 08:45:21 EVENT COMPLETED: node_up clnode01
Oct 21 08:45:21 EVENT START: node_up_complete clnode01
Oct 21 08:45:22 EVENT START: node_up_local_complete
Oct 21 08:45:22 EVENT START: start_server FCDB
Oct 21 08:45:23 EVENT COMPLETED: start_server FCDB
Oct 21 08:45:23 EVENT COMPLETED: node_up_local_complete
Oct 21 08:45:23 EVENT START: node_up_local_complete
Oct 21 08:45:23 EVENT START: start_server APP1
Oct 21 08:45:24 EVENT COMPLETED: start_server APP1
Oct 21 08:45:24 EVENT COMPLETED: node_up_local_complete
Oct 21 08:45:24 EVENT COMPLETED: node_up_complete clnode01
Oct 21 08:45:29 EVENT START: network_down -1 tgmssa_net
Oct 21 08:45:29 EVENT COMPLETED: network_down -1 tgmssa_net
Oct 21 08:45:29 EVENT START: network_down_complete -1 tgmssa_net
Oct 21 08:45:30 EVENT COMPLETED: network_down_complete -1 tgmssa_net


4. No errors were reported during initial discovery.
During synchronization the only warning being reported is as follows:

=================================
WARNING: File 'netmon.cf' is missing or empty on node clnode02. This file is needed for a cluster with the single-adapt
er network tgmssa_net. Please create 'netmon.cf' file on node clnode02 as described in 'HACMP Planning and Installation
Guide'.
=====================================


Any ideas?

thx in advance,
r,m.
 
there is an error reported in system and linked to topsvcs subsystem:

there is written:
Failure Causes
Lack of 'mbufs'
Network is down
I/O errors while accessing heartbeating device
Remote side is not present

But the TMSSA test communication on stopped cluster completes with success:

root@clnode01 /dev/pts/2 /tmp
# ls -al /dev/tm*
c-wx------ 1 root system 43, 0 Oct 21 10:49 /dev/tmssa102.im
cr-x------ 1 root system 43, 1 Oct 20 15:43 /dev/tmssa102.tm
cr-------- 1 root system 43,65535 Oct 20 15:43 /dev/tmssako
root@clnode01 /dev/pts/2 /tmp
# echo just a simple test >/dev/tmssa102.im
root@clnode01 /dev/pts/2 /tmp
#

root@clnode02 /dev/pts/0 /
# ls -al /dev/tm*
c-wx------ 1 root system 42, 0 Oct 21 10:06 /dev/tmssa101.im
cr-x------ 1 root system 42, 1 Oct 21 10:06 /dev/tmssa101.tm
cr-------- 1 root system 42,65535 Oct 21 10:06 /dev/tmssako
root@clnode02 /dev/pts/0 /
# cat </dev/tmssa101.tm
just a simple test



# errpt -a|more
---------------------------------------------------------------------------
LABEL: TS_NIM_ERROR_RDWR_E
IDENTIFIER: 90D3329C

Date/Time: Thu Oct 21 10:39:32 CUT
Sequence Number: 480
Machine Id: 0001355F3D00
Node Id: clnode01
Class: S
Type: PERM
Resource Name: topsvcs

Description
NIM read/write error

Probable Causes
Topology Services Network Interface Module (NIM) error:
Read error while trying to retrieve packets
Write error while trying to send packets

Failure Causes
Lack of 'mbufs'
Network is down
I/O errors while accessing heartbeating device
Remote side is not present

Recommended Actions
Correct device or network problem
Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.37,5460
ERROR ID
6Q8noE0Y5tR//UYf/k10e.1...................
REFERENCE CODE

1: read operation 0: write operation
0
Error detailed information
value1: error count value2: errno
Error data 1
200
Error data 2
69
Interface name
ssa102
---------------------------------------------------------------------------
LABEL: TS_NIM_ERROR_RDWR_E
IDENTIFIER: 90D3329C

Date/Time: Thu Oct 21 10:36:26 CUT
Sequence Number: 479
Machine Id: 0001355F3D00
Node Id: clnode01
Class: S
Type: PERM
Resource Name: topsvcs

Description
NIM read/write error

Probable Causes
Topology Services Network Interface Module (NIM) error:
Read error while trying to retrieve packets
Write error while trying to send packets

Failure Causes
Lack of 'mbufs'
Network is down
I/O errors while accessing heartbeating device
Remote side is not present

Recommended Actions
Correct device or network problem
Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.37,5460
ERROR ID
6Q8noE0e2tR//zg70k10e.1...................
REFERENCE CODE

1: read operation 0: write operation
0
Error detailed information
value1: error count value2: errno
Error data 1
50
Error data 2
69
Interface name
ssa102
---------------------------------------------------------------------------




root@clnode01 /dev/pts/2 /tmp
# lssrc -Ss topsvcs
#subsysname:synonym:cmdargs:path:uid:auditid:standin:standout:standerr:action:multi:contact:svrkey:svrmtype:priority:signorm:sigforce:display:waittime:grpname:
topsvcs:::/usr/sbin/rsct/bin/topsvcs:0:0:/dev/console:/var/ha/log/topsvcs.default:/var/ha/log/topsvcs.default:-O:-Q:-K:0:0:20:0:0:-d:30:topsvcs:
root@clnode01 /dev/pts/2 /tmp
# tail -20 /var/ha/log/topsvcs.default
---- /usr/es/sbin/cluster/utilities/clrsctinfo -p cllsnim -c ----
#name:desc:addrtype:path:para:grace:hbrate:cycle:custom_hbrate:gratarp:entry_type:next_generic_type:next_generic_name:src_routing
ether:Ethernet Protocol:0:/usr/sbin/rsct/bin/hats_nim::60:2:10:1000000:1:adapter_type:transport:Generic_UDP:1
token:Token Ring Protocol:0:/usr/sbin/rsct/bin/hats_nim::90:2:10:1000000:1:adapter_type:transport:Generic_UDP:1
fddi:Fiber Data Optical Protocol:0:/usr/sbin/rsct/bin/hats_nim::60:2:10:1000000:1:adapter_type:transport:Generic_UDP:1
hps:High Performance Switch:0:/usr/sbin/rsct/bin/hats_nim::60:2:10:1000000:1:adapter_type:transport:Generic_UDP:1
atm:Asynchronous Transfer Mode Protocol:0:/usr/sbin/rsct/bin/hats_nim::90:2:10:1000000:0:adapter_type:transport:Generic_UDP:1
rs232:RS232 Serial Protocol:1:/usr/sbin/rsct/bin/hats_rs232_nim::60:2:5:2000000:0:adapter_type:transport::0
tmscsi:TMSCSI Serial protocol:1:/usr/sbin/rsct/bin/hats_scsi_nim::60:2:5:2000000:0:adapter_type:transport::0
tmssa:TMSSA Serial protocol:1:/usr/sbin/rsct/bin/hats_ssa_nim::60:2:5:2000000:0:adapter_type:transport::0
diskhb:Disk Heartbeating Protocol:1:/usr/sbin/rsct/bin/hats_diskhb_nim::30:0:4:2000000:0:adapter_type:transport::0

---- /usr/es/sbin/cluster/utilities/clrsctinfo -p clhandle -ac ----
3:clnode01
4:clnode02

---- /usr/es/sbin/cluster/utilities/clrsctinfo -p cllsnw -Sc ----
ether_net:public:disable::clnode01:clnode01_serv:clnode01_stby:::clnode02:clnode02_serv:clnode02_stby::
tgmssa_net:serial:unknown::clnode01:tmssa01:::clnode02:tmssa02::
 
Hi !

I had the same problem.

I changed in the script
/usr/es/sbin/cluster/utilities/cl_rsh
the call to /usr/es/sbin/cluster/utilities/clrsh to /usr/bin/rsh and the problem was gone.

It seems that clrsh doesn't allow some commands.

best regards
Michael
 
Hi mreich,
the problem with clrsh resides in the user used for remote login which is "nobody".
Get a look to /usr/es/sbin/cluster/cspoc/cdsh which is perfect for running something on a cluster's node
 
Hi,

just wanted to know, whether your resource groups are associated with right service ip. you have two resource groups and according to you the node priority is defined but did you associate the resource group with the service ip on which you want the resource to be available. Please check the same.

Surender dutt
 
The problem had already been solved. I got a tip from IBM - they detected that below command fails executed on clnode02...

/usr/es/sbin/cluster/utilities/cl_rsh clnode01 /bin/odmget HACMPtopsvcs

The mistake I did was I had installed HACMP on clnode01 before user accounts were migrated from clnode02. (before user migration "hacmp" account on both nodes had different UID and GID - after users migration "hacmp" files on clnode01 no longer belonged to "hacmp" user = belonged to noone.

r, m.


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top