Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cluster failure after loss of public network. 1

Status
Not open for further replies.

ctangora

Technical User
Apr 28, 2004
4
US
We have two new MS 2003 servers working as a cluster, and had a fairly serious failure last week. We've been working with our vendor to figure out why, but one suggestion is that if both nodes of a cluster lose their public network, the cluster would go into an unstable / unusable state.

The servers have a public and private network, and the public network dropped, while the private network held tight. This happened at the exact same time on two separate servers, so I don't see it as a hardware issue.

Has anybody heard of a problem similar to this where a network outage caused a cluster to go unresponsive?

Thanks.


 
At my last place we had a network "event" were all the network switches stopped responding for a few seconds and the clusters stayed online.

From what I understand of what happened to us, after a power outage all the devices outside of the data center started to come back online. Well the desktop guy had a little network switch in his lab that he used to fix and build machines. When this little network switch came online it forces an election which caused all the switches to stop handeling traffic for a minute.

All the clustered servers stayed online during the election however, we just couldn't talk to them.

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

--Anything is possible. All it takes is a little research. (Me)
[noevil]
 
In windows 2000, media sense was enabled by default; in 2003 clustering it's not. Media sense will detect when a network adapter disconnects and unbind all the protocols from it. The upside is a speedy failover if a NIC goes down. THe downside is that even a momentary outage, like rebooting a switch, will cause a cluster failover.

If you can't get to the cluster via the network, in many cases this is misread as the cluster being "unresponsive". I think Mrdenny's post illustrates this. If this is ok with you and is the behavior you desire, then that is the default in 2003. If the public interface of each node is attached to a different switch, and you want the cluster to failover if a given switch is down, then you should consider enabling media sense. There are pro and cons either way. You need to decide which set you want to live with.






 
thanks xmrse.

That seems to be the best answer I have found anywhere so far.

So in theory, if both nodes were connected to the same switch. If that switch went down temporarily, would node 1 attempt to fail, but not be able to because node 2 had the same dropped connection signal?
 
If you don't enable media sense, then the network inteface itself doesn't fail. You have to wait for the clustered service that depends on the network to fail. This can be a while. Once that service fails, you retry however many times you have retries set before failing the group over.

If you do have media sense enabled, it'll fail right away. Of course, if the other node is connected to the same switch, it'll fail right back.



 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top