I have a 2 node active/passive cluster on Windows 2003 SP1. We'll call the nodes A and B.
At any given point in the time, the active node has no issues, but the cluster serivce cannot start on the passive node. However, what is interesting is, suppose A is active, and the cluster service cannot start on B; if I decide to shutdown node A, the cluster service on B magically starts! And Node becomes active. Now I go back and turn node A back on, and now it's cluster service cannot start. So in summary, both servers are able to host the resources at any given point in time, but both nodes cannot be "ready" at the same time.
The errors logged when the cluster service won't start seem to the most common ones; 7031, 1009, 118, 1209.
There is no indiciation that the private network heartbeat between the two servers is having any problems, though I'm not ruling it out either.
The cluster.log file shows that the non-active node is continously attempting to access the quorum, but obvioulsy it cannot since the active has it locked. One of the MS KB articles mentions this happens when both nodes think they should be active, for example if they boot at the same time. Our servers aren't booting at the same time, but the symptoms are identical.
Also, I found another MS KB article that says if the CLUSDB file exceeds 10MB you will experience problems because a certain timeout occurs before the whole 10MB file can be read. There is a hotfix for this issue, but it only applies to pre-SP1. The fix would not install on our system. Oour CLUSDB file recently exceed 10MB. Does anyone believe this file size may still be an issue?
At any given point in the time, the active node has no issues, but the cluster serivce cannot start on the passive node. However, what is interesting is, suppose A is active, and the cluster service cannot start on B; if I decide to shutdown node A, the cluster service on B magically starts! And Node becomes active. Now I go back and turn node A back on, and now it's cluster service cannot start. So in summary, both servers are able to host the resources at any given point in time, but both nodes cannot be "ready" at the same time.
The errors logged when the cluster service won't start seem to the most common ones; 7031, 1009, 118, 1209.
There is no indiciation that the private network heartbeat between the two servers is having any problems, though I'm not ruling it out either.
The cluster.log file shows that the non-active node is continously attempting to access the quorum, but obvioulsy it cannot since the active has it locked. One of the MS KB articles mentions this happens when both nodes think they should be active, for example if they boot at the same time. Our servers aren't booting at the same time, but the symptoms are identical.
Also, I found another MS KB article that says if the CLUSDB file exceeds 10MB you will experience problems because a certain timeout occurs before the whole 10MB file can be read. There is a hotfix for this issue, but it only applies to pre-SP1. The fix would not install on our system. Oour CLUSDB file recently exceed 10MB. Does anyone believe this file size may still be an issue?