Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

IPSIs lose connectivity like clockwork

Status
Not open for further replies.

vaoldschl

MIS
May 17, 2001
522
US
I've got an interesting one for folks to wrap their brains around. Started last week and recurred this week. CM loses connectivity to IPSI and port network (obviously) becomes unavailable. Happens in two systems, one running CM 5.0 and one running 4.3. The 5.0 is an ESS with duplicated IPSI and the standby does not take control. Problem happened exactly half an hour later on the second round than the first occurence. One minute prior to the IPSI alarm I get a ETH-PT error from one of the CLANs indicating a socket is down and a session lost. Problem is corrected by reseating IPSIs. All IPSIs are running latest firmware. This is the same firmware running on all other systems in production and only 2 out of the lot are experiencing the problem. Here are the alarms from the 4.3:

ID MO Source On Bd Lvl Ack Date
1 LIC-ERR n MAJ Y Tue Jul 07 03:35:01 EDT 2009
2 IPMEDPRO 01A12 y MIN Y Mon Jul 06 23:45:22 EDT 2009
3 EXP-PN PN 01 n MAJ Y Mon Jul 06 23:38:13 EDT 2009
4 TTR-LEV n MAJ Y Mon Jul 06 23:34:22 EDT 2009
5 EXP-PN PN 01 n MAJ Y Mon Jul 06 23:34:22 EDT 2009
6 PKT-INT 01A y MAJ Y Mon Jul 06 23:33:21 EDT 2009

The ETH-PT error is 2313 aux 9 and 1538 aux 0 and happens at 23:32. The 1538 indicates some sessions are down and the other indicates basically the same thing but is supposed to tell what session. The errors/alarms for the 5.0 are similar in nature and timing but ETH-PT indicates different session numbers.

I am remote to both systems but might need to make a case for traveling to the sites for resolution. Finally, the 4.3 IPSI and server are plugged into a Cajun that links up to a Cisco at our core. I don't believe the tech who did the install assigned a management IP for the Cajun on our internal network (at least not within the convention of my voice VLAN) so I have been unable to verify settings. I would think, though, that if a network blip of some kind were causing this on our LAN or WAN that the IPSI and server communication would continue as they are essentially segmented from the network by the Cajun. This is not the case; I can get to the server and ping the CLANs and MedPros but get nothing from the IPSIs.
 
What exact firmware version are you on? This sounds like a problem that was noticed in FV42.
 
If you log onto the server shell and type pingall -a does the IPSI's respond?
 
Have you tried connecting directly to the services port?
 
The IPSI does not respond to pingall -a from the shell.

The location is remote so I have not, to this point, had the opportunity to attempt the services port. That would be my first step if I do end up traveling.
 
Can you try the pingall -a from bash prompt on the ESS server thqat is local to the effected IPSI. If this responds and the main server does not get a response then look at wan or main server control networks.
 
I am logging in to the ESS when I attempt the ping command on the 5.0. Also, same behaviour on the 4.3 which is stand alone.
 
This is way out there but worth a shot.

Check the duplex settings on the ipsi match the Data switch settings.
 
I get the same problems all the time, and the issue is due to the heartbeat counts between the controlling servers and the remote ipsi's. The heartbeat counts go one per second for 3 seconds; therefore if it loses 3 consecutive heartbeats, the ipsi does a warm reboot. If after this reboot it sees the servers again, then control is re-established. If it doesn't still see the servers then it does a cold reboot and looks at the ESS servers for control until the network stability is such that the ipsi's can then re-register back to the main controlling servers.

I've put a more comprehensive explanation below, but not sure if the diagrams explaining it will appear.


IPSI Sockets and Heartbeats

The CM Server communicates with a Port Network via a TCP socket connection to the IPSI as shown in Figure 1. This connection is critical to all communications that go through the port network (G650). All control signals for all endpoints and adjuncts that connect through the Port Network are multiplexed and sent via this TCP socket connection.







Figure 1: Server – Port Network Connection





The server exchanges heartbeats with the IPSIs every second. IPSI sanity failure occurs when a heartbeat is missed and if no other data has been received from the IPSI during the last second. During a Control Network outage, the server and the IPSIs buffer all downstream and upstream messages in queues. If the socket communication is restored before the IPSI Socket Sanity Timeout is reached, the socket communication resumes and all queued messages are sent. This recovery is represented by



Region A in Figure 2.



If the IPSI sanity failures last longer than the IPSI Socket Sanity Timeout setting but shorter than 60 seconds, then recovery actions are initiated, including closing and reopening the socket connection (all downstream and upstream messages buffered in queues are lost), resetting the PKTINT (Packet Interface on the IPSI cards) and performing a warm restart of the affected port network. This recovery is represented by Region B in Figure 2.



If the IPSI sanity failures last longer than 60 seconds, the affected port network goes through a cold restart. This recovery is represented by Region C in Figure 2.







If an alternate control path to the affected port network is available and viable, interchange to the alternate control path is made after 3 seconds of IPSI sanity failures. Alternate control path is either:



• Secondary IPSI in the same port network, or



• Fiber connection via ATM switch or Centre Stage Switch If both primary and secondary IPSI connections have concurrent network outages (most likely due to non-diverse-path routing), the secondary IPSI connection is not viable and thus not available for interchange.



Recovery Behaviour in Region A



If the Control Network outage is shorter than the IPSI Socket Sanity Timeout (Region A in Figure 2), the upstream and downstream data that were blocked and buffered due to the network outage will resume flowing after the TCP recovery. All connections that go through the port network are preserved. All messages are buffered and sent with a delay due to the network outage and recovery. See Table 1 for more details on recovery behaviour in Region A. Refer to Figure 3 when reading Table 1. Note that there are two port networks in this example. The network outage happens in the WAN. The port network at the remote site is affected by the network outage, but the port network at the main site is not affected by the network outage.



Recovery Behaviour in Region B



If the Control Network outage is longer than the IPSI Socket Sanity Timeout period, but is shorter than 60 seconds (Region B in Figure 2), then, the port network goes through a warm restart and the Packet Interface (PKTINT) is reset. This results in lost upstream and downstream messages and results in LAPD links being reset and C-LAN socket connections being closed and reopened. Most stable calls will stay connected. Calls in transition may be lost. See Table 2 for more details on recovery behaviour in Region B.



Recovery Behaviour in Region C

If the Control Network outage is longer than 60 seconds, the recovery of the affected port network requires a cold restart of the port network1. This drops all calls going through the affected port network. Only the shuffled IP-to-IP calls that are not using any port network resources will stay connected until the user drops the call. See Table 3 for more details on recovery behaviour in Region C.




[Started on Version 3 software 15 years a go]
 
Look for PSN001380u on the Avaya website. Most likely you would have 1 analoge board in your system which is causing this.
First thing you can try is disable test-number 51 from communication manager to see if the problem does not occur anymore.

 
I don't have any TN2215s in the PN so that isn't the problem.

inerguard, I get the whole IPSI <=> CM heartbeat thing and can see that happen in the logs. Thing I don't get is there's no recovery. Your diagrams didn't come through but I'm guessing that's where the cold restart thing needs to happen and I'm accomplishing that but only restarting the IPSIs, not the entire PN. But if I have a network outage that is lasting more than 60 seconds our NOC would see more than just the PBX crashing. Also, if something were causing the network to go down every Monday night/Tuesday morning why would it (at both locations) increment itself by exactly 1/2 an hour?

I have redundant IPSIs in the 5.0 location and am going to verify all the port settings today as well as make sure there is as much redundancy in the network as I can. I am anticipating that tonight at 12:45 AM the proverbial will once again hit the fan.
 
Update: still haven't seen the root cause of the original outages but there was no outage this week. The logs show no sanity check failures except from where I moved the standby IPSI to a new blade on our switch at the 5.0 location. Fingers crossed I continue to monitor and wait for Tier 4 to tell me something I don't already know. This weekend we upgrade firmware then I'm thinking about moving to the latest and greatest dot upgrade.
 
I have a pair of IPSI's that need their firmware upgraded. I was able to upgrade one but the other (which is now the standby) won't let me upgrade it. I have had various messages.

IPSI Related Error: sftp cmd failed; try to run ssh: (null)
This is most common error.

or this morning says - Security Operation Error: Download Stop

Any ideas?

M1kep
 
What type of call volume traffic to you have across your WAN link? We had an issue where the wan link was engineered using G.729 Codec but was implemented using G.711. We had issues during peak call volumes where the bandwidth would max out with Call traffic and the IPSIs would lose connection to the main server long enough to drop everything, then re-establish connection.

Not sure if thats the same issue, but it might be worth checking bandwidth/codec settings.
 
Turns out our Security Analysts implemented a new scan and didn't exclude the voice equipment subnets. When the scan touched the IPSI boards they went blind and brought down the PNs. I found the IP address of the scanner using the Logmanager Debug and lots of coffee. Thanks for all the input.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top