well, it depends on how much you want to spend eliminating the single point of failure (SPOFs)!
as grepper said above, yes it is better but it depends on how your switch is configured to operate! I know that, for example, on some switches, if one port is failed, this failed port can be replaced on the fly by another port without even the server connected to the failed port notice! So this is somehow a means of eliminating port failures! but what if the whole swtich failed? Then you will have a communication problem! But if you have two switches(with two ethernet adapters configured, one standby for the other) then if one switch failed the other ethernet will still work (by doing an adapter swap process)
But even in this case what if the power connected to both switches failed! then you go back to square 0
So it all depends on how much you want to spend in coming up with the best design according to your budget
I don't know what are u using for heartbeating in your place (most of the configuration uses disk heartbeating!)
yes you can use an rs232 cable for extra heartbeating but i don't think you need it if you already have the disk heartbeating in place!
coz you will be having the heartbeating over the ethernets any way! so the disk and the serial cable is just another type of heartbeating to clear the SPOF!
The main reason behind using the second type of heartbeating (disk or serial) is to avoid going into something called a Partitioned Cluster. Which occurs if the ethernet adapters on the nodes of the cluster failed to communicate with each other for some reason (like for example failed switch that connects the nodes of the cluster) and if there was no other means for heartbeating (like disk or serial) then both nodes will think that the other node was failed, so each of them will try to take over the resource. Since each node is, in fact, still alive, the result is that the applications are now running simultaneously on all nodes of the cluster. If the shared disks are also online to both nodes then the result could be a quite massive data corruption problem!
so it is always a good idea to have what they call non-IP network (disk or serial) to carry out the heartbeating as well to avoid getting into a Partitioned Cluster.
Basically, I have 3 heartbeats, and they are all IP based, going to 2 different switches. Eventhough my switch has redudancy built-in, when someone happens to it, there is minor outage (which actually takes me off the nework for '1' second), and on the errpt, I all three interfaces go into 'recovery mode'.
Therefore, I was thinking of adding a serial based heartbeat. What are some implications of adding a disk heartbeat? Are they tough to manage? do I need an extra LUN on the SAN side, my shared storage, for it? FOr some reason, I have an idea that its easier to maintain HACMP from a serial side.
As Khalidaaa said, serial based heartbeat and disk based heartbeat purposes is to avoid a partitioned cluster, this is the two nodes are up and each node thinks the other is down, so each node tries to acquire the disks and the IPs (hope you never suffer that).
But none of these heartbeats increases IP availability and you will continue to have those "minor outages".
Personally, I prefer using disk heartbeat rather than serial heartbeat. You don't need extra LUN, it's easy to create and maintain, and it uses the same fiber cables as the shared storage, so there's one cable less (serial one).
In addition, in an LPAR environment, LPARS can't use integrated serial adapter, so you'll need an extra adapter to use serial heartbeat.
I am not too worried about IP avalibality (thats a seperate team). I only worry about HACMP + Application running on the server.
Recently, I did have a 'partitioned cluster'. Node1 was in the process of failing over to Node2, and the IP came online, and Node2 had a 'dead man switch' or 'split brain', and performed a haltq. Eventhough all of Node1's resources didn't move to Node2, the application was still running, but caused an outage during the day.
So I was thinking, if I had a serial heartbeat in that situtation, my system would still be responding, correct?
If node2 performed a halt q this usually means that definitions and configuration is not syncd between both nodes. So the node that detects some mismatch kills himself.
In this case it's possible that a non-ip heartbeat couldn't help much, depends on the causes that made node1 failover to node2. If node1 failovers because it has lost disks or connectivity, and node2 kills himself, node1 would be aware but it could happen that it could not reacquire the RG. Without non-ip heartbeat node1 wouldn't ever be aware that node2 has halted.
Anycase, it's always better to have a non-ip heartbeat, I normally tells my clients that it's required, not only desired.
Yeah Mag007, what happened is really a split brain problem! and there was a recovery program which halted the second node! you can find the halt q in the clexit.rc recovery.
When you have a partitioned cluster, the nodes on each side of the partition detect this and run a node_down for the node on the opposite side of the partition. If, while runing this or after communication is restored, the two sies of the partition do not aree on which nodes are still memebers of the cluster, a decision is made as which partition should remain up, and the other partition is halted by a Group Services merge from nodes in the other partition or by a node sending a GS merge itself. (lowest node number in the cluster remains- generally the first in alphabetical order)
as what morefeo said, it is easier to have a disk heartbeating than serial. You don't have to have any thing but the current link to your disks.
I don't have access to my work place to access the servers and tell you about the steps on doing so but i will do so once i get in there
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.