Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HA error

Status
Not open for further replies.

kitokato

Technical User
Aug 31, 2004
5
FI
For the past few weeks we have been getting a weird HA error.

There are errors on Events tab:
Ha agent on XXXESX1 in cluster XXX in cluster XXX in datacenterXXX has an error.

After that there is messages of:
Insufficient resources to satisfy HA failover on cluster XXX in datacenterXXX
Ha agent on XXXESX2 in cluster XXX in cluster XXX in datacenterXXX has an error.
Unable to contact a primary HA agent in cluster XXX in datacenterXXX

-after this the virtual machines are disconnected (they keep running though)

ssh console connection to esx gives an error:
resource temporary unavailable

It also seems that this happens on some sort of regular intervals.
Atleast the times when the first ESX has reported the error seems to be roughly the same (at 4:50 am).
There are roughly 4-5 days between, before the error reoccurs.

Currently the only fix we had, is to reboot both ESX servers.

We have a setup of 1 virtual center (ver 2.5) and 2 ESX servers (ver 3.5).
We have only few virtual machines currently running.

Im happy to provide any additional information.
Thanks for all your help.
 
Have you got the required entries in the ESX host files on both ESX servers? Also if you disable and re-enable HA, what happens?

--------------------------------------
"Insert funny comment in here!"
--------------------------------------
 
Host entries are as follows:
ESX1 has:
127.0.0.1 localhost.localdomain localhost
192.168.118.46 esx1.domain.com esx1

ESX2 has basically the same, but with ESX2 information.

We have tried to reconfigure HA host. It gives the following events:
1. Ha is being disabled on ESX2 in cluster XXX in datacenter XXX.
2. Error detected on ESX2 in XXX : cmd remove failed
3. Unable to contact a primary HA agent in cluster XXX in XXX



 
Stop ha on both esx servers, Then reenable. Both. I had a similar problems. Also make sure you can ping each by name.
 
Just a follow-up...
We have called vmware support and it seems the problemwas caused by Pegasus.
According th vmware Pegasus is something that is used for ESX Health Status feature. It seems that Pegasus was launcing new processes on the ESX servers. Stopping the pegasus service and disabling it (from autostarting) allowed us to restart the HA and it seems to be working OK for the moment.

The reason for the pegasus odd behavior is currently unknown and their are looking into it. If and when it gets resolved ill post results here.
 
Can your ESX servers ping each other by short name and long name (ie by esx1 and by esx1.domain.com as well)?

VMWare recommend that if you are using the hosts file for resolution, you must include both short name and long name (as you have done) but include entries for all ESX servers

--------------------------------------
"Insert funny comment in here!"
--------------------------------------
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top