Learning HACMP 1

Mag0007 · Jul 29, 2005

I am in the process of learning HACMP on AIX. Can someone please explain the concepts? I have been looking at the Redbooks, but there is just TOO much info on it.

I have been looking at this page

http://www-1.ibm.com/servers/eserver/clusters/whitepapers/hacmp_bestpractices.html

itsp1965 · Jul 29, 2005

If I am not mistaken I remember seeing a Redbook called HACMP for AIX v5 Certfication guide. Being a certification guide and not overtly detailed, it should cover everything you would need to know to set it up.

ogniemi · Jul 30, 2005

http://www-1.ibm.com/servers/eserver/pseries/library/hacmp_docs.html

CrystalWizard · Jul 31, 2005

-Can someone please explain the concepts

basic concept:

Your users never notice when your machine is down, because another machine takes over processing their data

Basicaly, you have one machine running your applicaton, and if it crashes or has to be taken down, you have a second machine standing by. Hacmp takes care of shutting down the machine that's crashing then changing ip's on network cards, varying on vg's and starting the application on the standby machine so that you have no real down time, just a little bit of lag...from the users point of view

What other concepts are you wanting to know about?

oliverbeat · Jul 31, 2005

Hi,
Is-it possible to have 3 nodes using HACMP and failedover instead of haveing 2 nodes ?
Thanks.
Regards

CrystalWizard · Jul 31, 2005

Yes. You can set them up so that node 1 and 2 both fail over to node 3, and node 3 fails over to either 1 or 2 (which ever you prefer) or doesn't fail over at all, just sits there and waits for one of the others to fail.

You can do that with 4, 5 and so on nodes. However at some point you're really taking a risk. If you have more than one node set up to fail over to the same machine, you are gamboling that you'll never have more than 1 node fail at the same time. If you had 4 nodes all set up to fail over to the 5th node and 3 of them did at the same time, the 5th node better be real powerful or it won't be able to handle the sudden load.

Mag0007 · Aug 1, 2005

CrystalWizard :
Amazing post! Thats exactly what I was looking for. Some questions:
Delay Part: Does the application have to support HACMP, or is this done all at the OS level? Also, how does the application know where the system went down? How does it pick up from that?
Shutdown/DASD: When you issue a shutdown from your primary node, do you just run "shutdown" or is there a HACMP shutdown you run? Regarding the DASD/volume groups, how does it automatically varyon the volumegroups? is there a heartbeat from the rs232 cable, that says, "hey system going down...varyon here"

Thanks for your post again! Much Much cleaner then Redbooks

oliverbeat · Aug 1, 2005

Thank you CristalW. !
I need to find the document about it as the only document I have is to configure only 2 nodes.

CrystalWizard · Aug 2, 2005

-Delay Part: Does the application have to support HACMP, or -is this done all at the OS level? Also, how does the -application know where the system went down? How does it -pick up from that?

HACMP is a set of scripts that runs on top of the OS and pretends to be you. What happens is this. When your machine goes down, hacmp signals the backup machine that it's going down and to prepare to take over. Then it runs your appliation shut down script (user written, by you), shuts down the application, unmounts the filesystems, vary's off the vg on the crashing machine.

It then swaps ip's, varys on the vg, mounts filesystems and starts up the application (with the application start up script, user written, by you), on the standby machine. The application has nothing to do with any of this, any more than it would if you decided to shut it down on one box and start it up on the other.

However there are lots of application related things that might prevent a good failover. Most of those are related to how the start up and shutdown scripts for the application. A lot of times, filesystems won't unmount (for example), because something's still running from inside one of them, instead of cding out of it. Fuser on AIX will NOT show you everything holding open a filesystem. You need to get a copy of LSOF and use that.

So good, careful planning in your implimentation will go a long way toward making everything work.

-Shutdown/DASD: When you issue a shutdown from your primary -node, do you just run "shutdown" or is there a HACMP -shutdown you run?

You tell hacmp what shutdown scripts to use and it runs them. After the application is down, hacmp runs it's own shutdown procedure and then aix finishing going down.
Remember that AIX doesn't just turn off when a machine crashes, unless you have an extremely old version. It does all sorts of things, including writting out a dump file and syncing the drives. Testing failover by yanging the power cord out of the wall not only does NOT simulate a normal crash (unless you lose power to the data center and in that case you dont' have any machines), it can also corrupt your fileystems on the machine you unplugged.

-Regarding the DASD/volume groups, how does it automatically -varyon the volumegroups? is there a heartbeat from the -rs232 cable, that says, "hey system going down...varyon -here"

you have the same vg imported to both machines, the primary and standby, but varied off, on the standby. When failover occurs, the hacmp demon runs a number of scripts, one of which reads through the resource groups you have configured, makes a note of each vg listed in the resource groups, and then runs the varyonvg command on each one of them.

You can set heartbeat up several ways, and if the primary machine stops sending heartbeat, eventualy the secondary machine will, yes, decide the primary is dead and start running it's startup routines. However in the case of a crash, the hacmp demon on the primary tells the hacmp demon on the secondary "i'm going down, take over", and the swap begins immediately without waiting for heartbeat.

-Thanks for your post again! -Much Much cleaner then -Redbooks

You're welcome

Mag0007 · Aug 2, 2005

Crystal:
I think you should put this in HACMP faq or write your own guide on the net

Thanks again!!!!

CrystalWizard · Aug 2, 2005

it's essential to have good scripting knowledge, but NOT for hacmp. The hacmp scripts and other files are written by IBM and maintained by IBM. However, you as Admin need to write the scripts to start up, shut down and do other things to your application and database.

All HACMP does is call your scripts IF you tell it to. Without the scripting knowledge, you can't write good scripts and may wind up in trouble.

Good scripts does NOT mean complicated scripts. In fact, a lot of small scripts that might only contain one command, all called by a single master script just might be a lot better and a lot more efficient than a single, complicated, spagetti script.

-I think you should put this in HACMP faq or write your own -guide on the net

Thanks for the compliment. You're welcome to copy anything I've written and put out a FAQ if you wish.

Mag0007 · Aug 2, 2005

Testing failover by yanging the power cord out of the wall not only does NOT simulate a normal crash (unless you lose power to the data center and in that case you dont' have any machines), it can also corrupt your fileystems on the machine you unplugged."

Isn't this the whole point of HACMP? Or fault tollerance?
How are you supposed to test HACMP then?

CrystalWizard · Aug 2, 2005

the only supported method of testing hacmp is to do a graceful shutdown with fail over.

Yes, the pont of hacmp is fault tollerance, but realisticaly, it's very rare that a machine would suddenly lose power as it does when you yank the cord from the wall. Even if your entire data center was to lose power, you'd still hopefuly have the machine on a UPS that would allow you a bit of time to shut down...however if the entire data center lost power, your secondary wouldn't be taking over anyway, as it would also be without power.

In other situations, such as if the machine experienced a crash and core dumped, as aix goes down, even if it looks to you as admin like the machine just powered off, it does do a lot of things, one of those things being the hacmp demon on the primary sending a signal to the secondary to take over.

IF you pull the cord from the wall or IF you pull the ethernet cable from the adapter, understand that you are NOT testing the ability of the first machine to do anything. All you are testing is the secondary, whether the secondary will detect that the heartbeat is missing, that the primary has stopped responding, and then eventualy come online.

The risk of damage to your filesystems is, in my opinion, way too much to try something like this. Since you have to have hacmp up and running to do this, and that means the filesystems have to be mounted, pulling the cord out of the wall and not allowing aix to shutdown as it is supposed to even under a crash, is a very dangerous thing.

Mag0007 · Aug 3, 2005

again...VERY good post! keep up the good work!
Thanks for closing this issue for me.

hirschaj · Aug 3, 2005

Another way to simulate a system crash would be to use "reboot -q". Running this command can cause filesystem corruption due to a sync not being performed. It is safer to run this type of command while your processes are up but no application activity is happening.

A safer but less accurate simulation of a system crash would just be using the reboot command with no flags. This is what I use and have never had a problem with it.

Jim Hirschauer

http://www.aixexpert.com

CrystalWizard · Aug 3, 2005

-run this type of command while your processes are up but no -application activity is happening.

which would make it impossible to run it on a system with a normal HACMP set up, because you need the filesystems to be mounted. YOu need that, because you need to see if hacmp is going to be able to unmount them as part of the fail over.

-just be using the reboot command with no flags

Much safer indeed

Mag0007 · Aug 3, 2005

noted...

thanks for the feedback.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Learning HACMP 1

Mag0007

MIS

itsp1965

IS-IT--Management

ogniemi

Technical User

CrystalWizard

Technical User

oliverbeat

Technical User

CrystalWizard

Technical User

Mag0007

MIS

oliverbeat

Technical User

CrystalWizard

Technical User

Mag0007

MIS

CrystalWizard

Technical User

Mag0007

MIS

CrystalWizard

Technical User

Mag0007

MIS

hirschaj

MIS

CrystalWizard

Technical User

Mag0007

MIS

Similar threads

Part and Inventory Search

Sponsor