Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HACMP --theory question 1

Status
Not open for further replies.

mag007

IS-IT--Management
Nov 8, 2006
99
US
I have a HACMP cluster, 2 nodes.

It seems both of the nodes share the same network switch. Is it a good idea to have 2 seperate network switch, in case of failure?

 
Ideally yes. It gives you another point of redundancy.
 
well, it depends on how much you want to spend eliminating the single point of failure (SPOFs)!

as grepper said above, yes it is better but it depends on how your switch is configured to operate! I know that, for example, on some switches, if one port is failed, this failed port can be replaced on the fly by another port without even the server connected to the failed port notice! So this is somehow a means of eliminating port failures! but what if the whole swtich failed? Then you will have a communication problem! But if you have two switches(with two ethernet adapters configured, one standby for the other) then if one switch failed the other ethernet will still work (by doing an adapter swap process)

But even in this case what if the power connected to both switches failed! then you go back to square 0 :)

So it all depends on how much you want to spend in coming up with the best design according to your budget

Regards,
Khalid
 
how about getting a Rs-232 cable as a heartbeat the mix? Would that beat the SPOF?
 
I don't know what are u using for heartbeating in your place (most of the configuration uses disk heartbeating!)

yes you can use an rs232 cable for extra heartbeating but i don't think you need it if you already have the disk heartbeating in place!

coz you will be having the heartbeating over the ethernets any way! so the disk and the serial cable is just another type of heartbeating to clear the SPOF!

The main reason behind using the second type of heartbeating (disk or serial) is to avoid going into something called a Partitioned Cluster. Which occurs if the ethernet adapters on the nodes of the cluster failed to communicate with each other for some reason (like for example failed switch that connects the nodes of the cluster) and if there was no other means for heartbeating (like disk or serial) then both nodes will think that the other node was failed, so each of them will try to take over the resource. Since each node is, in fact, still alive, the result is that the applications are now running simultaneously on all nodes of the cluster. If the shared disks are also online to both nodes then the result could be a quite massive data corruption problem!

so it is always a good idea to have what they call non-IP network (disk or serial) to carry out the heartbeating as well to avoid getting into a Partitioned Cluster.

Regards,
Khalid
 
Start deserved.

Basically, I have 3 heartbeats, and they are all IP based, going to 2 different switches. Eventhough my switch has redudancy built-in, when someone happens to it, there is minor outage (which actually takes me off the nework for '1' second), and on the errpt, I all three interfaces go into 'recovery mode'.

Therefore, I was thinking of adding a serial based heartbeat. What are some implications of adding a disk heartbeat? Are they tough to manage? do I need an extra LUN on the SAN side, my shared storage, for it? FOr some reason, I have an idea that its easier to maintain HACMP from a serial side.

TIA

 
As Khalidaaa said, serial based heartbeat and disk based heartbeat purposes is to avoid a partitioned cluster, this is the two nodes are up and each node thinks the other is down, so each node tries to acquire the disks and the IPs (hope you never suffer that).
But none of these heartbeats increases IP availability and you will continue to have those "minor outages".

Personally, I prefer using disk heartbeat rather than serial heartbeat. You don't need extra LUN, it's easy to create and maintain, and it uses the same fiber cables as the shared storage, so there's one cable less (serial one).
In addition, in an LPAR environment, LPARS can't use integrated serial adapter, so you'll need an extra adapter to use serial heartbeat.

HTH, and sorry for my poor english
 
Thanks for the advice.

I am not too worried about IP avalibality (thats a seperate team). I only worry about HACMP + Application running on the server.

Recently, I did have a 'partitioned cluster'. Node1 was in the process of failing over to Node2, and the IP came online, and Node2 had a 'dead man switch' or 'split brain', and performed a haltq. Eventhough all of Node1's resources didn't move to Node2, the application was still running, but caused an outage during the day.

So I was thinking, if I had a serial heartbeat in that situtation, my system would still be responding, correct?
 
If node2 performed a halt q this usually means that definitions and configuration is not syncd between both nodes. So the node that detects some mismatch kills himself.
In this case it's possible that a non-ip heartbeat couldn't help much, depends on the causes that made node1 failover to node2. If node1 failovers because it has lost disks or connectivity, and node2 kills himself, node1 would be aware but it could happen that it could not reacquire the RG. Without non-ip heartbeat node1 wouldn't ever be aware that node2 has halted.
Anycase, it's always better to have a non-ip heartbeat, I normally tells my clients that it's required, not only desired.
 
Yeah Mag007, what happened is really a split brain problem! and there was a recovery program which halted the second node! you can find the halt q in the clexit.rc recovery.

When you have a partitioned cluster, the nodes on each side of the partition detect this and run a node_down for the node on the opposite side of the partition. If, while runing this or after communication is restored, the two sies of the partition do not aree on which nodes are still memebers of the cluster, a decision is made as which partition should remain up, and the other partition is halted by a Group Services merge from nodes in the other partition or by a node sending a GS merge itself. (lowest node number in the cluster remains- generally the first in alphabetical order)

as what morefeo said, it is easier to have a disk heartbeating than serial. You don't have to have any thing but the current link to your disks.

I don't have access to my work place to access the servers and tell you about the steps on doing so but i will do so once i get in there

I hope these links would help



Regards,
Khalid
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top