Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Node powered off - No failover

Status
Not open for further replies.

gqma0

Technical User
Nov 29, 2002
31
GB
Hi everybody.

We have an active/passive cluster setup under W2K SP3 for testing products and failover scenarios.

The systems are setup with dual ports HBA on a single JBOD.

Failover occured properly when unplugging the network cable or FC link and when node shutdown occurs. moving the group is OK as well.

However, failover doesn't work when you physically remove the plug off the node. Cluster administrator freeze on the other node and requires both node to be restarted.

Should it happen or not ? HBA firmwares and drivers are up to date and so are the NICs for the heartbeat.

Any ideas to help troubleshoot the issue are more than welcome

Thanks in advance
Regards
Gaetan

 
This sounds exactly like the problem we are having. Running cluster on a 300G NAS device with dual channel qlogic fibres. When we pull both fibre connections, cluster dies.
 
I had this originally on my cluster.

Make sure that you installed the heartbeat absolutely to Microsoft'd recommendations - i.e. On the First Cluster Node, when configuring the heartbeat it MUST be able to contact the second heartbeat connection .... they suggest that the Second Node is powered on with a blank floppy disk in so that it has power (which will enable the heartbeat NIC) but it won't load any OS.

Also, when you have booted both nodes .... you should be able to see the Cluster disks on both machines under My computer ... but only access these from the First Node ... If you can't see the cluster disks on the Second Node ... even though the Cluster Administrator says its working ... It ain't believe me .

Suggest also booting everything in order .... Disk pack first, then First Node then Second Node .

Hope this helps
 
Hi.

I installed the cluster using the step by step recommendation from MS. I did setup the heartbeat using a crossover cable and I also disabled the registry setting as recommended. I also use the same bandwidth rate for the NICs and the systems have same NIC's, same drivers, same HBA, same firmware, same everything.

In my case, we're not using the second port on the qlogic HBA and the issue doesn't occur when pulling off the link between the node and storage.

It happens when I pull off the POWER cable from the active node. I encountered the same issue on a customer site on their test environment and it froze the whole thing.

This is a big issue and I'm still looking what I'm doing wrong.

Talismanuk, I'll try out your recommendation with the blank floppy but there is heartbeat as soon as I switch on the second node.

Thanks
Regards
 
We finally figured out what was going on with our setup. We had the 300Gs cabled to a FastT 500 storage array from IBM. After 2 weeks of pain and suffering, we finally got them to say yes to a firmware upgrade on the storage Fastt 500.

Upgraded the firmware and the cluster was functioning and failing like it should. IBM shipped the storage with a piece of firmware that does not support clustering totally. If this is your setup, feel free to email me at rmartoncik@compsat.com and I can give you all the details
 
Thanks for the tip but my issue was slightly different in the way that link failure to the storage wouldn't prevent the cluster to fail. It only fails when the Node is powered off by pulling off the power cable.

Thanks anyway
Gaetan
 
We have contacted IBM about quite the same problem. Gracefull failovers work perfect but as soon we power off the cluster dies. The reason for it is that the node, that is still online, can't reach the quorum disk.

We already did an upgrade on the SAN aswell as all the drivers for the NAS 300G engines but so far without any succes.

Do more people recognize this problem?

kind regards,
Oscar Weijer


email: oweijer@xs4all.nl
 
Oscar.

Never managed to sort out this one and since we're only testing a software behaviour during a failover, this is not a major issue for us but I'd still be interested in solving this problem.

MSCS gurus, have you stress tested your cluster by directly pulling off the plug ??

Regards
Gaetan
 
I informed IBM regarding this problem and they told me SM release 8 if free of charge now. We're planning to upgrade to that release (firmware 5.x for the FastT) in order to solve our issue.

I will keep you informed if we get any result, but I'm most positive getting result after the upgrade because Bob Martoncik had similar problem. An upgrade fixed his problemn, hope it will do for us.

regards,
Oscar
 
Quote from gqma0
MSCS gurus, have you stress tested your cluster by directly pulling off the plug ??

I have 2 Dell 4600's with perc 3 raid cards attached to a Dell Powervault 220s running in an active/active mode. I have pulled the plug (at different times) and have not run into any of the above problems. The quorum drive and any other drives transferred over immediately.
 
All, the problem we were having was directly related to IBM equipment running Microsoft Cluster. Our other clusters were failing over fine. This is IBM specific and the firmware upgrade to 8.x fixes the problem. They have noted this in their lab in raleigh.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top