Redundancy hot-standby failed in HA config rls 7.5 1

mmxmaverix · Jul 11, 2011

Hi Avaya/Nortel experts,
we have an issue with HA configuration in CS1000 system. Status in LD135 is:
cp 0 22 PASS -- ENBL
SYSTEM STATE = REDUNDANT
DISK STATE = REDUNDANT HEALTH = 14
cp 1 22 PASS -- STDBY SYSTEM
STATE = REDUNDANT
DISK STATE = REDUNDANT HEALTH = 14

We were testing hotstandby, but after we have disconnected (or powered it off) CPU 0 (which was active) the second CPU didn't took control over the system but it went to sysload state. I think that normally it should take control in few seconds without any impact on the system. Could you please help me with this issue?

Thank you.

M1spezi · Jul 12, 2011

For a test you can unplug the Elan from the activ Callserver.The standby CPU will be active and work without interruption..

When you power down the activ CS, then the inactiv CPU will be activ after an INI

Meridian, Callpilot, CCM (Symposium) from Germany

mmxmaverix · Jul 13, 2011

M1spezi,
if I disconnect ELAN from active CPU - standby CPU take control over the system in 5-10 seconds without any problem.

But we're testing the situation when the whole cabinet (with active CPU) is suffering power outage.

NortelGuy1979 · Jul 13, 2011

If the active CPU goes offline (via power in the cabinet, for example), the offline CPU will lose the heartbeat (via HSP) from the online CPU, will sysload, then come into service.

Your heartbeat should not be on the ELAN unless the HSP isn't functioning properly.

Sysloading is normal on the inactive side for it to come into service if the online core connection is completely lost. If it goes into a loop, you have a software problem on that side.

Matthew - Technical Support Engineer Sr.

http://www.linkedin.com/in/matty79

NortelGuy1979 · Jul 13, 2011

Sorry; I was slightly off - here's a description of the switchover states:

Graceful Switchover
In normal operation the health count of each CPU should be equivalent. In the case where the active CPU detects that the redundant CPU has better health, a graceful switchover is invoked. In this process, almost the entire memory image from the active CPU is copied over to the memory of the redundant CPU. The redundant CPU resumes the operations left off from the active CPU after going through a post-switchover procedure. This post-switchover procedure includes sending out a gratuitous ARP message to the IP world for informing where the active IP ELAN address is located. This CPU becomes the active side.
The previously active side invokes a warm start after the copying operation is completed. After the warm start, it becomes the redundant side.
During a graceful switchover, there is usually no impact to calls already in progress. There is a brief duration whereby new calls are not allowed in the neighborhood of 6-8 seconds depending upon the configuration.
Graceful switchover may be invoked manually using the SCPU command in overlay 135.

Ungraceful Switchover
When it is decided that the active side is inoperable (e.g. power or processor failure, watchdog timeout, exceptions), the **redundant side warm starts** and takes over control. The switchover does not occur immediately, because when the redundant side detects loss of heartbeat, it must wait long enough to be sure that the active side is not simply performing a warm start (INI). The timer used to invoke the ungraceful switchover is in the order of 56 seconds.

Heartbeat
The two CPUs exchange heartbeats to determine if the other CPU is reachable over the HSP. The heartbeat protocol also carries information regarding the health count of each CPU. If the HSP is disconnected then the heartbeat protocol attempts to traverse the ELAN instead
If the heartbeat cannot be communicated between the two CPUs meaning that connection over the HSP and ELAN is lost between the two CPUs then the redundant CPU warm starts to become active after a certain period of time.
By optimizing timeout and threshold parameters used in retries of the heartbeat mechanism, ungraceful switchover trigger time is reduced to less than 15 seconds. The optimization in the timing leads to a change in the INI policy. When the active core warm starts, the inactive core also reboots, so no swapping of the cores takes place.

So by unplugging the ELAN, the health changed, and you got a "graceful" switchover - 6-8 seconds. The heartbeat was still being carried over the HSP. When you powered down the active call server, you got a ungraceful switchover which takes up to 56 seconds (so the docs say) and also invokes an INI on the offline side. I'd guess if it sysloaded also, then something was wrong with the offline side. Make sure both CPU's are patched (patch it in redundant mode) and then test again; make sure you can boot off both cores and run off both cores w/o error.

This is all from the System Redundancy NTP's in the Campus Redundancy section - the description is the same as non-campus redundant HA configuration.

Matthew - Technical Support Engineer Sr.

http://www.linkedin.com/in/matty79

mmxmaverix · Jul 15, 2011

Thank you for the answer, you are right, I've found it in documentation... It's not sysload as I've thought, it's warm start - ini, but I was confused with VXworks logo during the startup of the standby CPU...

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Redundancy hot-standby failed in HA config rls 7.5 1

mmxmaverix

Technical User

M1spezi

Technical User

mmxmaverix

Technical User

NortelGuy1979

Programmer

NortelGuy1979

Programmer

mmxmaverix

Technical User

Similar threads

Part and Inventory Search

Sponsor