IPSIs lose connectivity like clockwork

vaoldschl · Jul 9, 2009

I've got an interesting one for folks to wrap their brains around. Started last week and recurred this week. CM loses connectivity to IPSI and port network (obviously) becomes unavailable. Happens in two systems, one running CM 5.0 and one running 4.3. The 5.0 is an ESS with duplicated IPSI and the standby does not take control. Problem happened exactly half an hour later on the second round than the first occurence. One minute prior to the IPSI alarm I get a ETH-PT error from one of the CLANs indicating a socket is down and a session lost. Problem is corrected by reseating IPSIs. All IPSIs are running latest firmware. This is the same firmware running on all other systems in production and only 2 out of the lot are experiencing the problem. Here are the alarms from the 4.3:

ID MO Source On Bd Lvl Ack Date
1 LIC-ERR n MAJ Y Tue Jul 07 03:35:01 EDT 2009
2 IPMEDPRO 01A12 y MIN Y Mon Jul 06 23:45:22 EDT 2009
3 EXP-PN PN 01 n MAJ Y Mon Jul 06 23:38:13 EDT 2009
4 TTR-LEV n MAJ Y Mon Jul 06 23:34:22 EDT 2009
5 EXP-PN PN 01 n MAJ Y Mon Jul 06 23:34:22 EDT 2009
6 PKT-INT 01A y MAJ Y Mon Jul 06 23:33:21 EDT 2009

The ETH-PT error is 2313 aux 9 and 1538 aux 0 and happens at 23:32. The 1538 indicates some sessions are down and the other indicates basically the same thing but is supposed to tell what session. The errors/alarms for the 5.0 are similar in nature and timing but ETH-PT indicates different session numbers.

I am remote to both systems but might need to make a case for traveling to the sites for resolution. Finally, the 4.3 IPSI and server are plugged into a Cajun that links up to a Cisco at our core. I don't believe the tech who did the install assigned a management IP for the Cajun on our internal network (at least not within the convention of my voice VLAN) so I have been unable to verify settings. I would think, though, that if a network blip of some kind were causing this on our LAN or WAN that the IPSI and server communication would continue as they are essentially segmented from the network by the Cajun. This is not the case; I can get to the server and ping the CLANs and MedPros but get nothing from the IPSIs.

thejkro · Jul 9, 2009

What exact firmware version are you on? This sounds like a problem that was noticed in FV42.

vaoldschl · Jul 9, 2009

colby007 · Jul 9, 2009

If you log onto the server shell and type pingall -a does the IPSI's respond?

thejkro · Jul 9, 2009

Have you tried connecting directly to the services port?

vaoldschl · Jul 9, 2009

The IPSI does not respond to pingall -a from the shell.

The location is remote so I have not, to this point, had the opportunity to attempt the services port. That would be my first step if I do end up traveling.

billybobdan · Jul 10, 2009

Can you try the pingall -a from bash prompt on the ESS server thqat is local to the effected IPSI. If this responds and the main server does not get a response then look at wan or main server control networks.

vaoldschl · Jul 10, 2009

I am logging in to the ESS when I attempt the ping command on the 5.0. Also, same behaviour on the 4.3 which is stand alone.

billybobdan · Jul 11, 2009

This is way out there but worth a shot.

Check the duplex settings on the ipsi match the Data switch settings.

inerguard · Jul 13, 2009

I get the same problems all the time, and the issue is due to the heartbeat counts between the controlling servers and the remote ipsi's. The heartbeat counts go one per second for 3 seconds; therefore if it loses 3 consecutive heartbeats, the ipsi does a warm reboot. If after this reboot it sees the servers again, then control is re-established. If it doesn't still see the servers then it does a cold reboot and looks at the ESS servers for control until the network stability is such that the ipsi's can then re-register back to the main controlling servers.

I've put a more comprehensive explanation below, but not sure if the diagrams explaining it will appear.

IPSI Sockets and Heartbeats

The CM Server communicates with a Port Network via a TCP socket connection to the IPSI as shown in Figure 1. This connection is critical to all communications that go through the port network (G650). All control signals for all endpoints and adjuncts that connect through the Port Network are multiplexed and sent via this TCP socket connection.

Figure 1: Server – Port Network Connection

The server exchanges heartbeats with the IPSIs every second. IPSI sanity failure occurs when a heartbeat is missed and if no other data has been received from the IPSI during the last second. During a Control Network outage, the server and the IPSIs buffer all downstream and upstream messages in queues. If the socket communication is restored before the IPSI Socket Sanity Timeout is reached, the socket communication resumes and all queued messages are sent. This recovery is represented by

Region A in Figure 2.

If the IPSI sanity failures last longer than the IPSI Socket Sanity Timeout setting but shorter than 60 seconds, then recovery actions are initiated, including closing and reopening the socket connection (all downstream and upstream messages buffered in queues are lost), resetting the PKTINT (Packet Interface on the IPSI cards) and performing a warm restart of the affected port network. This recovery is represented by Region B in Figure 2.

If the IPSI sanity failures last longer than 60 seconds, the affected port network goes through a cold restart. This recovery is represented by Region C in Figure 2.

If an alternate control path to the affected port network is available and viable, interchange to the alternate control path is made after 3 seconds of IPSI sanity failures. Alternate control path is either:

• Secondary IPSI in the same port network, or

• Fiber connection via ATM switch or Centre Stage Switch If both primary and secondary IPSI connections have concurrent network outages (most likely due to non-diverse-path routing), the secondary IPSI connection is not viable and thus not available for interchange.

Recovery Behaviour in Region A

If the Control Network outage is shorter than the IPSI Socket Sanity Timeout (Region A in Figure 2), the upstream and downstream data that were blocked and buffered due to the network outage will resume flowing after the TCP recovery. All connections that go through the port network are preserved. All messages are buffered and sent with a delay due to the network outage and recovery. See Table 1 for more details on recovery behaviour in Region A. Refer to Figure 3 when reading Table 1. Note that there are two port networks in this example. The network outage happens in the WAN. The port network at the remote site is affected by the network outage, but the port network at the main site is not affected by the network outage.

Recovery Behaviour in Region B

If the Control Network outage is longer than the IPSI Socket Sanity Timeout period, but is shorter than 60 seconds (Region B in Figure 2), then, the port network goes through a warm restart and the Packet Interface (PKTINT) is reset. This results in lost upstream and downstream messages and results in LAPD links being reset and C-LAN socket connections being closed and reopened. Most stable calls will stay connected. Calls in transition may be lost. See Table 2 for more details on recovery behaviour in Region B.

Recovery Behaviour in Region C

If the Control Network outage is longer than 60 seconds, the recovery of the affected port network requires a cold restart of the port network1. This drops all calls going through the affected port network. Only the shuffled IP-to-IP calls that are not using any port network resources will stay connected until the user drops the call. See Table 3 for more details on recovery behaviour in Region C.

[Started on Version 3 software 15 years a go]

ewaaijenberg · Jul 13, 2009

Look for PSN001380u on the Avaya website. Most likely you would have 1 analoge board in your system which is causing this.
First thing you can try is disable test-number 51 from communication manager to see if the problem does not occur anymore.

vaoldschl · Jul 13, 2009

I don't have any TN2215s in the PN so that isn't the problem.

inerguard, I get the whole IPSI <=> CM heartbeat thing and can see that happen in the logs. Thing I don't get is there's no recovery. Your diagrams didn't come through but I'm guessing that's where the cold restart thing needs to happen and I'm accomplishing that but only restarting the IPSIs, not the entire PN. But if I have a network outage that is lasting more than 60 seconds our NOC would see more than just the PBX crashing. Also, if something were causing the network to go down every Monday night/Tuesday morning why would it (at both locations) increment itself by exactly 1/2 an hour?

I have redundant IPSIs in the 5.0 location and am going to verify all the port settings today as well as make sure there is as much redundancy in the network as I can. I am anticipating that tonight at 12:45 AM the proverbial will once again hit the fan.

vaoldschl · Jul 14, 2009

Update: still haven't seen the root cause of the original outages but there was no outage this week. The logs show no sanity check failures except from where I moved the standby IPSI to a new blade on our switch at the 5.0 location. Fingers crossed I continue to monitor and wait for Tier 4 to tell me something I don't already know. This weekend we upgrade firmware then I'm thinking about moving to the latest and greatest dot upgrade.

m1kep · Sep 24, 2009

I have a pair of IPSI's that need their firmware upgraded. I was able to upgrade one but the other (which is now the standby) won't let me upgrade it. I have had various messages.

IPSI Related Error: sftp cmd failed; try to run ssh: (null)
This is most common error.

or this morning says - Security Operation Error: Download Stop

Any ideas?

M1kep

Trav4620 · Sep 25, 2009

What type of call volume traffic to you have across your WAN link? We had an issue where the wan link was engineered using G.729 Codec but was implemented using G.711. We had issues during peak call volumes where the bandwidth would max out with Call traffic and the IPSIs would lose connection to the main server long enough to drop everything, then re-establish connection.

Not sure if thats the same issue, but it might be worth checking bandwidth/codec settings.

vaoldschl · Sep 29, 2009

Turns out our Security Analysts implemented a new scan and didn't exclude the voice equipment subnets. When the scan touched the IPSI boards they went blind and brought down the PNs. I found the IP address of the scanner using the Logmanager Debug and lots of coffee. Thanks for all the input.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

IPSIs lose connectivity like clockwork

vaoldschl

MIS

thejkro

Programmer

vaoldschl

MIS

colby007

Programmer

thejkro

Programmer

vaoldschl

MIS

billybobdan

Technical User

vaoldschl

MIS

billybobdan

Technical User

inerguard

IS-IT--Management

ewaaijenberg

Technical User

vaoldschl

MIS

vaoldschl

MIS

m1kep

Vendor

Trav4620

IS-IT--Management

vaoldschl

MIS

Similar threads

Part and Inventory Search

Sponsor