Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Clustering Servers Causing System Errors

Status
Not open for further replies.

sandy3269

IS-IT--Management
Mar 23, 2004
18
US
Hi, I'm wondering if anyone has had this problem.

I recently had a server fail. After rebuilding it, I decided it was time to cluster our environment.

After setting the cluster name to CE_PROD and setting all servers in the running system to @CE_PROD, I brought the repaired and cluster named server back online.

Things seem to work for awhile but inventually, users seem not to be able to login and only a reboot will bring a server back online, causing the other server to fail. Only one server seems to be able to exist at the same time.

Any help on direction to trouble shoot would be appreciated.
 
Hi, are both clustered machines on the same subnet?
Both using the same OS..are all services runing on both machines?
Is the CMS database on a separate server - what database is being used?

Why reboot..does the whole server PC stop responding or just one of the services?

Any errors in the Event log of the machine that failed?



[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
Will network support please stand up! Turns out I had 2 problems.

1) MDAC problems were causing connections to hang. Upgrading to the current server pack from Microsoft fixed that.

2) When I had technical services rebuild the operating system, they did not bring all the servers up to the same level as the other servers in the cluster. Once corrected, problem solved.
 
Hmmm, the problem seemed to have reappeared.

Further details:

Servers are Windows 2000 all windows current patches and updates.

All datasources are stored on MS Sql 2000 clustered instances.

The server environment is a 4 server setup, with services distributed in a mirror configuration.

FRS storage is on a NAS share for failover.

CE10 is UNPATCHED. This is a short term requirement.

Any help is greatly appreciated!

 
Hi,
If it is just the 'logging in' that shows the problem then the issue probably only involves the CMS -
I am not sure using separate, even though clustered, Database servers is viable..I do not know enough about SqlServer's clustering but I know that the exact same data must be accessed by both cms servers; even a slight variance can cause issues..

What error message is produced when the problem rears its ugly head?






[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
Hi.

My CE servers are clustered and we also use NAS for our FRS storage. We use SQL Server in a 'fail over' environment for CE. If our main SQL server goes down and fails over to the backup server, the CMS changes to stopping mode. The only way out seems to be a server reboot. Unfortunately I haven't found a way out of that situation yet.
 
Hi,
We have set up 2 separate Oracle schemas ( on separate Oracle servers/instances - both in same subnet, however)for testing our BOEXI upgrade using clustered CMS servers.

( in our clustered v10 setup, the 2 CMS servers use the same Oracle instance..It was designed to prevent communication loss with the CMS if one CMS service went down, not to have fail-over functionality for the CMS data itself)

It appears that ( much like the way 2 sets of inout/output FRS servers are handled) the first CMS started is the active one and the second is enabled but inactive..If the first one goes down, the second one becomes active and no interruption in service happens.

The 'View Server Summary' tool will show both CMS servers as Enabled, but only 1 is shown as 'Alive'...



[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
Hi...

After some testing, I have found that the above scenario does not appear to work as expected..

If I import new objects into the XI system, it appears that only the CMS database that is active at the time has that data...If, perchance, you access the CMC from the server that has the 'other' CMS database (even when accessing by Cluster name) , the new objects do not appear...

I will look into it more, but, for now, we are reverting to 2 CMS servers using the same Oracle instance..

( It could be a time delay before resyncing, but I doubt it..)





[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
By chance, do you have windows network load balancing enabled on these windows servers?

When you log into CMS do all servers show up?
 
Hi,
No ( load balancing)

Yes ( All Servers show)

We are exploring using a load balancing farm for the user access pages ( the 'plain' web servers - just WCA installed)



[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
Well I've narrowed down my issue, and will be testing further today. Since I'm working in production, I have to of course, be very carefull with tests.

I have worked out these tests using vmWare instances, but not having load on them, they produced no satifying results.

I have methodically started each server on the box I'm rejoining to the cluster. Leaving time for users to report that the login issue has returned.

I have now eliminated it down to my 2 Webconnectors (we use 1 for external access, no auth on and the other for internal NT/AD auth).

I will start these and stop the others on the mirror box and a few other steps to validate, whats going on.

For peach1240, I had the issue you are talking about and resolved it by setting the restart time on failure for the CMS at the Windows servers screen. This way you can have 3 restart attempts before it hangs. This worked well for our Sql Clustered environment during its failovers.
 
ISSUE RESOLVED. IIS AUTHENTICATION SETTING AND WEBCONNECTOR DOMAIN SETTINGS WHERE DIFFERENT.

Clustered Server Tests

09:00:47 - Started all services except for the 2 webconnectors. This is part of a test to ascertain whether it is the CMS or just the Webconnectors that are causing the clustering issue.

10:28:39 - Ran tests from my desktop and had Donna also run from her's. No issues identified yet. Issue still points to Webconnectors.

16:37:02 - Ran tests again, from my desktop and had Donna also run from her's. Same as above.

2005-11-09 - 13:44:21. Stopped WCS(1) on port 6401, on Bridlemile while starting WCS(1) on port 6411 on Hillside. Also added Hillside WCS 6411 to the webconnector setup on Bridlemile. Now these 2 servers are again in sync.

13:53:41 - Tests results failure: File not found: C:\Program Files\Crystal Decisions\Enterprise 10\Web Content\enterprise10\admin\en\default.csp, on Bridlemile mirror server - Turned off Hillside WCS(2) and system returned to stable functioning.

14:01:44 - Started WCS(1) on Hillside. All webconnectors Except for WCS(2) on Hillside are running and the system is stable. I will now delete and recreate WCS(2). Issue still there, so not a corrupted service.

15:29:49 - Read some articules on (Ctrl-Click) Tek-Tips and BOB that makes me want to go through all settings again. I took a look again at the IIS security settings, NT Auth was checked on both servers but Anonymous was also checked on Hillside. Unchecked it and restarted the WCS(2) on that server. Testing.........

2005-11-09 - 15:41:56. Issue seems to be resolved. I will verify tomorrow.

2005-11-10 - 08:51:29. Logging this issue as closed.
 
2005-11-14 - 08:54:51. This isssue reopened because the errors returned. I have shut off WCS(2) on Hillside and removed the reference from Bridlemile WC to stop it from logging error messages. System functioning fine.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top