BrianCharlton
IS-IT--Management
I know this isn't Microsoft clustering but I though some of you guys may have had experience of using Legato Co-Standby Server 2000 (formerly Vinca) on Windows 2000.
We have a 2 node cluster armed to fail over in the event it loses either of its network connections (one onto production LAN and one onto dedicated VLAN) or the loss of it's mirrored data drive.
Over the last few days there have been Telecoms related "blips" on the VLAN.
Although these outages have been short lived - up to 50 secs max - the clustering software has recognised this failure and initiated a failover which only seems to succeed in leaving the whole cluster in operational when the service is restored.
Basically what happens is:
Primary server detects that is has lost network connectivity onto the VLAN or Production LAN.
Primary server checks it can still contact it's partner via dedicated link.
If it can contact the secondary and the secondary is able to ping a majority of the its ping list it deems the primary node to have failed (on some level) and initiates failover.
The server being failed over attempts to bring it's network interfaces up with the cluster address attached to it only to find that the network has relearnt all it's routes and this address is already in use by what is the failing primary.
As the IP address is already in use the server being failed over to then disables it's network connection leaving the cluster running fully on neither server.
As you can see this ends up with the cluster being in a mess and so we have temporarily had to disable automatic failover until the network issue is resolved.
However ideally we would like to look at the config of the cluster itself to try and prevent a temporary blip causing this type of issue. Nothing appears to be configurable within the GUI however the registry does have a heartbeat value - this can be pushed out from 5 secs however this doesn't really get to the crux of the issue which is about ensuring the cluster waits a short period before initiating a failure. Adding a timer merely delays the failover - it doesn't check to see if service has been restored.
Any help would be very gratefully received.
We have a 2 node cluster armed to fail over in the event it loses either of its network connections (one onto production LAN and one onto dedicated VLAN) or the loss of it's mirrored data drive.
Over the last few days there have been Telecoms related "blips" on the VLAN.
Although these outages have been short lived - up to 50 secs max - the clustering software has recognised this failure and initiated a failover which only seems to succeed in leaving the whole cluster in operational when the service is restored.
Basically what happens is:
Primary server detects that is has lost network connectivity onto the VLAN or Production LAN.
Primary server checks it can still contact it's partner via dedicated link.
If it can contact the secondary and the secondary is able to ping a majority of the its ping list it deems the primary node to have failed (on some level) and initiates failover.
The server being failed over attempts to bring it's network interfaces up with the cluster address attached to it only to find that the network has relearnt all it's routes and this address is already in use by what is the failing primary.
As the IP address is already in use the server being failed over to then disables it's network connection leaving the cluster running fully on neither server.
As you can see this ends up with the cluster being in a mess and so we have temporarily had to disable automatic failover until the network issue is resolved.
However ideally we would like to look at the config of the cluster itself to try and prevent a temporary blip causing this type of issue. Nothing appears to be configurable within the GUI however the registry does have a heartbeat value - this can be pushed out from 5 secs however this doesn't really get to the crux of the issue which is about ensuring the cluster waits a short period before initiating a failure. Adding a timer merely delays the failover - it doesn't check to see if service has been restored.
Any help would be very gratefully received.