Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Split Brain Syndrom: DAG switching between nodes

Split Brain Syndrom: DAG switching between nodes

by  woter324  Posted    (Edited  )
Intro

As Exchange 2010 is still quite new, there isn't a great deal of help around, so I though I would write up what we did to get it back.



Environment

Our environment consists of three HP Proliant DL360 G6 servers with SAS direct attached MSA 70. Two servers sit in our primary datacenter and the third sits in the secondary datacenter which we use for disaster recovery.



We are running MS Exchange 2010 RTM. All three servers run all roles. There is one DAG containing six databases. As our environment is split across AD sites, we have deployed Database activation coordination mode [link http://technet.microsoft.com/en-us/library/dd979790.aspx](DAC)[/link] to stop split-brain syndrome. The databases are split thus: three are active on EXC1001 and three are active on EXC1002. The third server, EXC1003, sits in a different AD site and contains passive copies of all six databases.



Problem

Yesterday we bounced our three mail servers to try and sort out an issue with backups and VSS writer locks. When the servers came back we couldn't get the cluster back and therefore we could not mount any databases. The cluster was flip-flopping between nodes. It was never sure which one was the quorum owner.



Pinging EXC1001 and EXC1002 servers resulted in an average of 5 successful pings, followed by 5 dropped packets.



Whatever we did, we could not get the cluster to connect to the EXC1003 in the secondary site. We could RDP and ping without issue.



The problem was most prevalent when looking in the Failover Cluster Manager at the DAG.somedomain.com under Cluster Core Resources. Here we could see the DAG status switching between online and offline constantly. Event ID 177 and Event ID 178 was very common.



Issues

We identified several possible issues that may have caused the problem:

[ul]

[li]Restarting the servers in an 'incorrect' way,[/li]

[li]The failover cluster had a damaged network cable,[/li]

[li]The drivers for HP network tools were from 2009,[/li]

[li]The HP network teaming may have been corrupted,[/li]

[li]Possible corruption of server routing tables.[/li]

[/ul]



Fixes

We didnÆt want the databases failing over to other member servers, so we activated the databases on EXC1001, shutdown EXC1003 and EXC1002, then restarted EXC1001. We left EXC1001 for a good 10 minutes before starting EXC1002 and the same for EXC1003. Exchange 2010 does not like being shut down in its entirety. We should have done one server at a time, setting any mailboxes to passive and suspending the copy.



As we run the DAG in DAC mode, EXC1001 & EXC1002 needed to know about EXC1003. The damages network cable was missing its clip and had been knocked. The switch port link light indicated the connection was going up and down.

During the building of these servers, after a restart we had some odd network behaviour. This was resolved by dissolving the HP network team and then re-teaming. We tried this again, but for now we left the NICÆs dissolved.



We noted the drivers for the NICs and the HP Network Configuration Utility were versions built in 2009. Because the environment was fine before the restart, we chose not to upgrade these drivers, however, if other steps had not resolved the issue we would have taken this step.



For reasons beyond the scope of this post, our servers have persistent routing tables set up. Windows« is not a router! It is always best practice to use a proper router. Due to odd network connectivity, we rebuilt the routing tables. This resulted in successful pings.



Solution

By this time, it was 10:30am and we (IT infrastructure) were getting hammered by the business. No doubt, our fixes helped things; we still couldnÆt mount any databases. It was evident our environment was suffering from split-brain syndrome, where Active Manager did not know which node was the quorum owner. Having taken some advice, we evicted EXC1003 from the secondary site. By the time we had removed references to EXC1003 from the EMC and went to manually mount the databases, Exchange was already mounting the databases.



Conclusion

We can not say exactly why routing tables and / or possibly the HP teaming went awry, but it looks like the restart triggered the issue which stopped the cluster members successfully communicating.



We donÆt think the incorrect method we used to shut down the servers had any bearing on the issue.



Clearly, the DAC environment needs visibility of all three members and the networking issues between the primary and secondary datacenters was not helping things.



Event logs

Here are some of the event logs we had. Hopefully google will pick these up:

Event ID 1069 Cluster resource 'IPv4 Static Address 1 (Cluster Group)' in clustered service or application 'Cluster Group' failed.



Event ID 1205 The Cluster service failed to bring clustered service or application 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.



Event ID 174 The cluster group is hosted on this server but the current role is Unknown. An attempt will be made to move the group.



Event ID 104. Source MSExchange Search Indexer



When trying to mount the database:



Event ID 4



Code:
Task Get-DatabaseAvailabilityGroup writing error when processing record of index 0. Error: Microsoft.Exchange.Cluster.Replay.DagNetworkRpcServerException: A server-side administrative operation has failed. 'GetDagNetworkConfig' failed on the server. Error: The NetworkManager has not yet been initialized. Check the event logs to determine the cause. ---> Microsoft.Exchange.Data.Storage.DagNetworkManagementException: The NetworkManager has not yet been initialized. Check the event logs to determine the cause.

   at Microsoft.Exchange.Cluster.Replay.NetworkManager.FetchInitializedMap()

   at Microsoft.Exchange.Cluster.Replay.NetworkManager.<>c__DisplayClass7.<GetDagNetworkConfig>b__6(Object , EventArgs )

   at Microsoft.Exchange.Cluster.Replay.NetworkManager.RunRpcOperation(String rpcName, EventHandler ev)

   --- End of inner exception stack trace (Microsoft.Exchange.Data.Storage.DagNetworkManagementException) ---

   at Microsoft.Exchange.Cluster.Replay.NetworkManager.RunRpcOperation(String rpcName, EventHandler ev)

   at Microsoft.Exchange.Cluster.Replay.NetworkManager.GetDagNetworkConfig()

   at Microsoft.Exchange.Cluster.ReplayService.ReplayRpcServer.<>c__DisplayClasse.<GetDagNetworkConfig>b__d()

   at Microsoft.Exchange.Data.Storage.Cluster.HaRpcExceptionWrapperBase`2.RunRpcServerOperation(String databaseName, RpcServerOperation rpcOperation)

   --- End of stack trace on server (EXC1003.somedomain.com) ---

   at Microsoft.Exchange.Data.Storage.Cluster.HaRpcExceptionWrapperBase`2.ClientRethrowIfFailed(String databaseName, String serverName, RpcErrorExceptionInfo errorInfo)

   at Microsoft.Exchange.Data.Storage.Cluster.HaRpcExceptionWrapperBase`2.ClientRethrowIfFailed(String serverName, RpcErrorExceptionInfo errorInfo)

   at Microsoft.Exchange.Data.Storage.Cluster.DagNetworkRpc.RunRpcOperation(String serverName, Nullable`1 timeoutMs, ReplayRpcClient& rpcClient, InternalRpcOperation rpcOperation)

   at Microsoft.Exchange.Data.Storage.Cluster.DagNetworkRpc.GetDagNetworkConfig(String serverName)

   at Microsoft.Exchange.Data.Storage.Cluster.DagNetworkRpc.GetDagNetworkConfig(DatabaseAvailabilityGroup dag)

   at Microsoft.Exchange.Management.SystemConfigurationTasks.GetDatabaseAvailabilityGroup.WriteResult(IConfigurable dataObject)

   at Microsoft.Exchange.Configuration.Tasks.GetTaskBase`1.WriteResult[T](IEnumerable`1 dataObjects)

   at Microsoft.Exchange.Configuration.Tasks.GetObjectWithIdentityTaskBase`2.InternalProcessRecord()

   at Microsoft.Exchange.Configuration.Tasks.Task.ProcessRecord()









I hope this is of use to some people. ?
Register to rate this FAQ  : BAD 1 2 3 4 5 6 7 8 9 10 GOOD
Please Note: 1 is Bad, 10 is Good :-)

Part and Inventory Search

Back
Top