Node powered off - No failover

gqma0 · Jan 15, 2003

Hi everybody.

We have an active/passive cluster setup under W2K SP3 for testing products and failover scenarios.

The systems are setup with dual ports HBA on a single JBOD.

Failover occured properly when unplugging the network cable or FC link and when node shutdown occurs. moving the group is OK as well.

However, failover doesn't work when you physically remove the plug off the node. Cluster administrator freeze on the other node and requires both node to be restarted.

Should it happen or not ? HBA firmwares and drivers are up to date and so are the NICs for the heartbeat.

Any ideas to help troubleshoot the issue are more than welcome

Thanks in advance
Regards
Gaetan

mudgeman · Jan 23, 2003

This sounds exactly like the problem we are having. Running cluster on a 300G NAS device with dual channel qlogic fibres. When we pull both fibre connections, cluster dies.

talismanuk · Jan 25, 2003

I had this originally on my cluster.

Make sure that you installed the heartbeat absolutely to Microsoft'd recommendations - i.e. On the First Cluster Node, when configuring the heartbeat it MUST be able to contact the second heartbeat connection .... they suggest that the Second Node is powered on with a blank floppy disk in so that it has power (which will enable the heartbeat NIC) but it won't load any OS.

Also, when you have booted both nodes .... you should be able to see the Cluster disks on both machines under My computer ... but only access these from the First Node ... If you can't see the cluster disks on the Second Node ... even though the Cluster Administrator says its working ... It ain't believe me .

Suggest also booting everything in order .... Disk pack first, then First Node then Second Node .

Hope this helps

gqma0 · Feb 11, 2003

Hi.

I installed the cluster using the step by step recommendation from MS. I did setup the heartbeat using a crossover cable and I also disabled the registry setting as recommended. I also use the same bandwidth rate for the NICs and the systems have same NIC's, same drivers, same HBA, same firmware, same everything.

In my case, we're not using the second port on the qlogic HBA and the issue doesn't occur when pulling off the link between the node and storage.

It happens when I pull off the POWER cable from the active node. I encountered the same issue on a customer site on their test environment and it froze the whole thing.

This is a big issue and I'm still looking what I'm doing wrong.

Talismanuk, I'll try out your recommendation with the blank floppy but there is heartbeat as soon as I switch on the second node.

Thanks
Regards

mudgeman · Feb 11, 2003

We finally figured out what was going on with our setup. We had the 300Gs cabled to a FastT 500 storage array from IBM. After 2 weeks of pain and suffering, we finally got them to say yes to a firmware upgrade on the storage Fastt 500.

Upgraded the firmware and the cluster was functioning and failing like it should. IBM shipped the storage with a piece of firmware that does not support clustering totally. If this is your setup, feel free to email me at rmartoncik@compsat.com and I can give you all the details

gqma0 · Feb 11, 2003

Thanks for the tip but my issue was slightly different in the way that link failure to the storage wouldn't prevent the cluster to fail. It only fails when the Node is powered off by pulling off the power cable.

Thanks anyway
Gaetan

OscarW · May 27, 2003

We have contacted IBM about quite the same problem. Gracefull failovers work perfect but as soon we power off the cluster dies. The reason for it is that the node, that is still online, can't reach the quorum disk.

We already did an upgrade on the SAN aswell as all the drivers for the NAS 300G engines but so far without any succes.

Do more people recognize this problem?

kind regards,
Oscar Weijer

email: oweijer@xs4all.nl

gqma0 · May 27, 2003

Oscar.

Never managed to sort out this one and since we're only testing a software behaviour during a failover, this is not a major issue for us but I'd still be interested in solving this problem.

MSCS gurus, have you stress tested your cluster by directly pulling off the plug ??

Regards
Gaetan

OscarW · May 28, 2003

I informed IBM regarding this problem and they told me SM release 8 if free of charge now. We're planning to upgrade to that release (firmware 5.x for the FastT) in order to solve our issue.

I will keep you informed if we get any result, but I'm most positive getting result after the upgrade because Bob Martoncik had similar problem. An upgrade fixed his problemn, hope it will do for us.

regards,
Oscar

pgoss · Jun 9, 2003

Quote from gqma0
MSCS gurus, have you stress tested your cluster by directly pulling off the plug ??

I have 2 Dell 4600's with perc 3 raid cards attached to a Dell Powervault 220s running in an active/active mode. I have pulled the plug (at different times) and have not run into any of the above problems. The quorum drive and any other drives transferred over immediately.

mudgeman · Jun 9, 2003

All, the problem we were having was directly related to IBM equipment running Microsoft Cluster. Our other clusters were failing over fine. This is IBM specific and the firmware upgrade to 8.x fixes the problem. They have noted this in their lab in raleigh.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Node powered off - No failover

gqma0

Technical User

mudgeman

Technical User

talismanuk

IS-IT--Management

gqma0

Technical User

mudgeman

Technical User

gqma0

Technical User

OscarW

Technical User

gqma0

Technical User

OscarW

Technical User

pgoss

IS-IT--Management

mudgeman

Technical User

Similar threads

Part and Inventory Search

Sponsor