HA error

kitokato · Sep 5, 2008

For the past few weeks we have been getting a weird HA error.

There are errors on Events tab:
Ha agent on XXXESX1 in cluster XXX in cluster XXX in datacenterXXX has an error.

After that there is messages of:
Insufficient resources to satisfy HA failover on cluster XXX in datacenterXXX
Ha agent on XXXESX2 in cluster XXX in cluster XXX in datacenterXXX has an error.
Unable to contact a primary HA agent in cluster XXX in datacenterXXX

-after this the virtual machines are disconnected (they keep running though)

ssh console connection to esx gives an error:
resource temporary unavailable

It also seems that this happens on some sort of regular intervals.
Atleast the times when the first ESX has reported the error seems to be roughly the same (at 4:50 am).
There are roughly 4-5 days between, before the error reoccurs.

Currently the only fix we had, is to reboot both ESX servers.

We have a setup of 1 virtual center (ver 2.5) and 2 ESX servers (ver 3.5).
We have only few virtual machines currently running.

Im happy to provide any additional information.
Thanks for all your help.

TheLad · Sep 5, 2008

Have you got the required entries in the ESX host files on both ESX servers? Also if you disable and re-enable HA, what happens?

--------------------------------------
"Insert funny comment in here!"
--------------------------------------

kitokato · Sep 5, 2008

Host entries are as follows:
ESX1 has:
127.0.0.1 localhost.localdomain localhost
192.168.118.46 esx1.domain.com esx1

ESX2 has basically the same, but with ESX2 information.

We have tried to reconfigure HA host. It gives the following events:
1. Ha is being disabled on ESX2 in cluster XXX in datacenter XXX.
2. Error detected on ESX2 in XXX : cmd remove failed
3. Unable to contact a primary HA agent in cluster XXX in XXX

nhidalgo · Sep 5, 2008

Stop ha on both esx servers, Then reenable. Both. I had a similar problems. Also make sure you can ping each by name.

kitokato · Sep 5, 2008

Just a follow-up...
We have called vmware support and it seems the problemwas caused by Pegasus.
According th vmware Pegasus is something that is used for ESX Health Status feature. It seems that Pegasus was launcing new processes on the ESX servers. Stopping the pegasus service and disabling it (from autostarting) allowed us to restart the HA and it seems to be working OK for the moment.

The reason for the pegasus odd behavior is currently unknown and their are looking into it. If and when it gets resolved ill post results here.

TheLad · Sep 5, 2008

Can your ESX servers ping each other by short name and long name (ie by esx1 and by esx1.domain.com as well)?

VMWare recommend that if you are using the hosts file for resolution, you must include both short name and long name (as you have done) but include entries for all ESX servers

--------------------------------------
"Insert funny comment in here!"
--------------------------------------

thalligan · Sep 6, 2008

If you are running HP hardware and agents with 3.5 update 2 you might want to check this out

http://communities.vmware.com/thread/159728?tstart=0

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

HA error

kitokato

Technical User

TheLad

Technical User

kitokato

Technical User

nhidalgo

MIS

kitokato

Technical User

TheLad

Technical User

thalligan

MIS

Similar threads

Part and Inventory Search

Sponsor