Remote office G650 resets 1

ck459 · Jun 30, 2006

Hi Group,

First of all I would like to say that I am not an avaya specialist, so it could be that I am saying things that do not make a lot of sence. Apologies for that.

We have a topology where we have a S8710 in our datacentre, that communicates with several remote G650 gateways (with IPSI,CLAN,MEDPRO). the interconnecting network is an MPLS cloud, and all signalling traffic (IPSI to S8710) is being prioritized (strict priority queue).

From time to time (averag : once every 2 weeks, at random times) we are experiencing resets of the remote G650 gateways, although we do not see disconnects on networking level. All out monitoring systems see the gateways as active (we ping with an interval of 1 second).

We have been told that if the G650 looses its connection to the S8710 for more than 3 seconds, it automatically resets (after 28 seconds). Latency of the MPLS network is 20ms, with peaks to 100ms

There is also a G650 in the same location as the S8710, and this one does not lose connectivity. (latency is 1ms here)

Once again, we do not see an outage of any kind on the network (from data perspective).

My question is : are we overlooking something ? Is there a setting that we have to tweak when the signalling is passing over a WAN network ? Can we increase the timers, so that the gateways do not lose connectivity that fast?

Any input is welcome, as our avaya guys are blaming the network

Thanks

Kurt

MaxEMEA · Jul 1, 2006

Kurt,

In the log files of CM there will more than likely be 'Sanity' errors.
The CM maintains constant communication to the IPSI via the PCD (Packet Control Driver) and to keep it simple, sends a 'Heartbeat' to the IPSI's every 1 second to which the IPSI MUST respond. If the PCD misses three responses in a row the PCD tears down the TCP Sockets and drives any interchanges to keep the PN online.
Then, an immediate recovery process begins and when the sockets are restored the IPSI will reset and come online. If this is the only IPSI in the PN the PN WILL restart.

If you have Dual IPSI's and the PN is resetting then this means BOTH IPSI's miss their heartbeats and a sure sign their has been a network issue. Problem is that the data world doesn't consider 3 seconds to be an outage.

Things to check:
IPSI's locked to 100MB FULL DUPLEX
If running QOS then ensure the QOS settings are correct on the IPSI. (Configured on the IPSI itself)

Hope this helps.

ck459 · Jul 2, 2006

Hi MaxEMEA

First of all, thanks for your response. This is indeed what Avaya told us as well (sanity errors), and you are absolutely right that 3 seconds in the dataworld is not a real outage. We have therefore modified our monitoring tool so that it polls every second. We don't see an outage however...

I have noticed that when the problem occurs, there is a higher delay on the network (100ms range instead of 20ms range). Could this cause these errors ?

Most important question : Is there a way to modify this heartbeat interval, or how many can be missed before such a 'sanity error' happens? Avaya told us not, but I find that hard to believe... Or is there maybe an other workaround?

Thanks

Kurt

nohuhu · Jul 2, 2006

ck459,

there's no way to modify the heartbeat interval, because it would defeat the integrity and operation of the phone system. unlike the ordinary data transmission world, telephony operates in REAL TIME and it has to be so, otherwise it just wouldn't work.

and yes, the increase in delay can cause these errors. the best you can do is tune your network so that the control network traffic (between media servers and ipsi boards) would have absolutely highest priority over all other kinds of data. don't be afraid, the control network traffic itself is not too bandwitdh consuming, so it will not paralyze your network.

it's just very sensitive to delays. if the remote mpls circuits has bandwith less than 2 mb/s you should consider packet fragmentation as well.

rjblanch · Jul 3, 2006

Hi all,

I am glad to see that someone got an answer out of Avaya. We had asked Avaya here in Aus and the only thing we got back was "must be a network issue". Nothing apart from that.

Did you get any other errors associated with these or was it just the PN's dropping off. We had a lot of errors and alarms, for example EAL, RSL, RNG-GEN, PFL and MPCL. I am just wondering if we are suffering the same issue?

Thansk.

nohuhu · Jul 4, 2006

MaxEMEA,

can you tell me in more detail, what is to be found in the system logs and where, to confirm that it's a sanity error and no other? i have a customer with similar problem, one remote epn resets spontaneously. no communication manager errors are logged, and there's just too much data in system logs in the web interface, i can't find it.

MaxEMEA · Jul 4, 2006

Dwalin,
This is what you'll see in the logs
06253: 20060626:121121826:1106242

cd(16553):HIGH:[[5:0] checkSlot: sanity failure (1)]
06257: 20060626:121122827:1106246

cd(16553):HIGH:[[5:0] checkSlot: sanity failure (2)]
06266: 20060626:121123826:1106255

cd(16553):HIGH:[[5:0] checkSlot: too many sanity failures]

You'll see:
Message no, date/time, sequence no, messagetype (pcd), priority, IPSI location ([5:0] = IPSI 6a), checkSlot (Process), Error (sanity failure) (x) = which heartbeat.

So, you can see from the log IPSI 6a missed three hearbeats in a row and failed (too many sanity failures) This is when the IPSI pcd will either interchange the IPSI's and try to recover the failed one.
If you are seeing checkSlot errors aginst both IPSI's at the same time the the PN will reset and a sure sign the PCD message didn't get thru.....network!

Just to explain the IPSI location [x:y]..
X = PN - 1
Y = A or B IPSI. 0 = A and 1 = B
So, [5:0] = 6a, get it?

Do a search in the logs for 'checkSlot' and you'll find them but remember the uppercase 'S' in checkSlot.

You can also FTP the log file to your local machine and use 'windows grep'(Just search on google) to search the logs.

Hope this helps.

rjblanch · Jul 4, 2006

Can you please let us know which log i could look for this error?

I have gone to the web interface and I am not sure which log to look into.

Thanks.

ck459 · Jul 4, 2006

Hi,

This is indeed what we are seeing in the logs. There is no other error, so only sanity failures.

Now, today something really odd happened: At 11am we noticed that there was a network outage at the central location (our old fashioned data network monitoring tool was detecting an outage, wow). This outage lasted for 4 seconds, and after that, connectivity was restored.

We opened a conf call with Avaya, and they told us there was no outage on any of the PNs (8 in total).So no sanity check errors in the log!!

This is where I start to be very suspicious towards Avaya, as where before we did not notice an outage, and all PN's except for the local one went down, and in this case today where nothing went down (although there was a full outage on that site)!!!

I cannot believe this product is designed like this, and I have asked the Avaya tech support to escalate this to a higher level, as this is abnormal behaviour to me.

Anayway, just wanted to express my frustration, as all signalling traffic between IPSI and S8710 is prioritized in a strict priority queue, and there is more voice traffic going over these links that is not experiencing any problem (except for the outage of this morning of course).

I cannot believe there are no other companies that are experiencing the same problem, and I would like to know from this board if there are other people/companies that are experiencing similar problems(appart from the people that have already replied).

Regards

Kurt

MaxEMEA · Jul 4, 2006

Check the 'Logmanager debug trace' and search for 'checkSlot'.

MaxEMEA · Jul 4, 2006

One way to prove it once and for all.......Put a sniffer between the control network port of the server and LAN and one at the remote end between the IPSI and the LAN.

That way you'll see the PCD message go in and come out.

nohuhu · Jul 6, 2006

MaxEMEA,

thanks a lot, i did found the sanity errors in the log. but the strange thing is that ipsi board does warm resets (level 1) and calls do not drop! however, tn2214b vintage 12 circuit packs do not survive the warm reset and go out of service for up to twenty minutes. that's why these resets were noticed in the first place... very weird case.
also i second to ck459, there's something wrong with sanity heartbeats in the 3.1.1 software. i've been monitoring the network performance for quite a long time and it's just like clockwork, round trip time of icmp echo packets (ping) from the media server to the guilty ipsi was never more than 6 milliseconds, 2 ms on the average. i can't imagine that such a small delay can cause a missing heartbeat.

i'll be opening a case with avaya on that, ck459 please update this thread if you get any results from avaya. i'll post anything i get also.

jimtr · Jul 6, 2006

dwalin,

I have the same problem with one of my customers.The diffrence is that the pn that does the resets is not a remote one.
Anyway, the Avaya engineer that is on this told me that if the connection to the IPSI is restored within 60 seconds since the third lost heartbeat then the pn will only undergo a warm restart.
When connection to the IPSI is restored after 60 seconds the PN will do a COLD restart in order to return to normal operation. This is required because after 60sec the PN has got out of sync with the Server so much that WARM restart would not be sufficient. When COLD restart takes place all circuit packs in the PN are reset, calls get dropped.
We are still searhcing for the root cause of this.
I noticed that on the switch (avaya C360) that we use for the connectivity between the servers and the IPSI (Dedicated control network) when i issue the command "show module 1" i see that the status for "Fans" is failed, so maybe the switch from time to time overheats and shuts down until the temperature falls below a threshold.
If this is resolved i will post here the outcome.

nohuhu · Jul 6, 2006

jimtr,

thank you, your input is very much appreciated. it seems that we're experiencing the same as you, epns are being reset warmly and it seems that ipsi connectivity restores before 60 seconds.
however, we can't possibly have any problems with switches and routers, it was the first thing we checked and rechecked. the network works like a swiss watch, no problems whatsoever. i started a ping session on the media server to check ipsi connectivity, and in this time interval not one packet was lost, average round trip delay was 2.5 milliseconds and maximum delay was 9 ms. however, media server reported one sanity error during this time. i wonder how a 9 ms delay can affect ipsi operation?! time to get a baseball bat and ask avaya for an answer.

mdbnh · Jul 18, 2006

Hi, Did you ever find a solution?

nohuhu · Jul 19, 2006

mdbnh,

yes but it was a network issue after all, hovewer very nasty one and it was very hard to find.

servertek · Jul 31, 2006

dwalin,
having similar issues can you elaborate?

nohuhu · Jul 31, 2006

servertek,

yes. this customer has several offices all over the city, all offices are connected to the same provider via metropolitan area network (ethernet) as a main connection and backup ones are serial links. all of a sudden the main connection would fail and cisco routers would switch over to backup one. from the data network point of view, transition goes smoothly (not one packet is lost) but from ipsi point of view it was like it loses three consecutive heartbeats in a row on a tcp socket (i think tcp packet serial numbers didn't match, we didn't go too deep into this) and goes to reboot. after fourteen seconds or so it restores connection to the media servers and performs a warm reboot instead of cold one. we haven't found a solution to the network outage problem but did move ipsi traffic to more stable backup links. the problem is still here but it does not affect business because resets are warm and call preserving.
we wouldn't know about this problem in the first place if not the tn2214b vintage 12 circuit packs, these boards didn't survive warm reset and gone to out-of-service mode for up to 20 minutes after reset. had to replace these boards with recent tn2214cp and the problem was solved. of a sorts.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Remote office G650 resets 1

IS-IT--Management

MIS

IS-IT--Management

Technical User

Programmer

Technical User

MIS

Programmer

IS-IT--Management

MIS

MIS

Technical User

Technical User

Technical User

IS-IT--Management

Technical User

MIS

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor