Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Bad audio quality when calls reaches 400-600 calls

Status
Not open for further replies.

redyps90210

IS-IT--Management
May 18, 2015
30
PH
Hi guys,

We have an Avaya Aura system set-up which is in full IP. We have calls coming in to our Avaya SBC-E via SIP trunks (100Mbps) then to the Session Manager then to Communication Manager.

However when incoming calls reaches 400 calls, we're experiencing bad voice quality. At first we thought that it could be a capacity issue but new calls don't drop at that point.

I placed a daily task in Avaya Site Admin to get the trunk and dsp measurements in a daily basis. But we can't check the issue until it happens again, which is very random. Usually during peak seasons or in case there's an outage in the customer's network.

Any thought what else can we check? I know this is very a broad question but I hope someone can give some steps. So far we did the following:


1. Checked the ethernet ports between SBC, SM and CM
- all are within the same location so this is very unlikely. It's a only a switch(es) that separates them.
- This will leave us either to the ingress of SBC or at the end of the call flow which is the CM

2. For the ingress SIP traffic, even if the link handles both G711 (95.2kbps) and G729 (39.2 kbps), the link can still handle the calls. If we maxing it out, let's say using the G711 BW which is around 95.2 kbps, the 100Mbps link can handle 1050 simulated calls.

3. This leave us the possible issue within the CM. But I checked the DSP resources and it's wide enough (around 1292 DSP's) distributed to 4 G450's under the same network-regions.


Thanks in advance!


 
To add, we have 3 SIP trunks (with 255 channels) each going to Session Manager 1 and Session Manager 2 respectively.
 
This may be very dependent on your SIP trunks, the type of connection to your SIP trunk providers, among other things. you mention you have 100Mb, but does that include all 3 SIP trunks? Is that purely for your SIP trunk (versus shared with other data), or through the internet? The codec you use both on the SIP trunk level and your phone/network region should tell you how much bandwidth you may be utilizing. Is your 100Mb connection being tagged so that voice is being tagged with EF or given highest priority? Does your LAN side have QoS for voice tagging as well?
 
Thanks texeric for responding to my question. Here are some more details.

We have three links going to each datacenter (2 sites/other is just a backup site), one at 100Mbit (burstable to 1Gigabit), one at 100Mbit flat and one at 10Mbit.

All providers have SLAs to be able to achieve at least 95% at any given time, the second 100Mbit link is much higher than that even.

The incoming calls are being distributed on a per client basis. Our largest two clients go over the Gigabit link, a few smaller ones over the 100Mbit link and one very small client goes over the 10 Mbit link. There is minimal data traffic going over the Gigabit link outside of production hours (backups), the other are 100% dedicated to voice.
It's important to note here that when we see a peak on just one client (going over just one link) all our voice traffic on the entire network is affected on all of the links.

The links to our 4 callcenter sites are 10Mbit each fully dedicated to voice. None of these sites are taking more than 120 call at a time over g729 and our links to the callcenters have never seen more than 40% utilization. I think these links can generally be ruled out because the voice problems happen on calls that are just in queue and therefore local to the main datacenter.

We have one firewall/router (it's a Juniper SRX 550 which is firewall and router in one device) at the edge of each datacenter to which all provider/client links connect to. Behind that is one Avaya/Altura provided ethernet switch which connects all other equipment including the SBCs. This switch has been setup by Altura and all links on the switch seem to be operating in Gigabit mode. Important to note here that there is only one device (Juniper SRX 550) between the provider and the switch. During load tests the Juniper has been tested up to 800Mbit and even at that time the CPU load was at less than 40%. It is rated by Juniper up 5.5Gbps on firewall performance.

The network diagram should give you all other detail but on a high level the signaling and voice are taking the following paths:

Signaling:
Provider/client
Juniper SRX 550
Ethernet switch (Avaya provided)
SBC
Session Manager
(Possibly IVR for some call flows, it is blind transferred to the CM so the IVR leg does not stay active)
Communication Manager

Voice:
Provider/client
Juniper SRX 550
Ethernet Switch (Avaya provided)
SBC
(Possibly IVR for some call flows, it is blind transferred to the CM so the IVR leg does not stay active)
Communication Manager

It is important to note here that generally the Juniper should only handle the incoming/outgoing traffic from/to the provider. Once the traffic hits the switch all the traffic should stay there locally and not go back to the Juniper anymore (unless it is transferred out to a callcenter site, however that does not seem to be relevant due to calls having bad call quality in queue already). Therefore the majority of traffic flows should be entirely within the Avaya infrastructure.

I do agree this is likely a network or resource related problem. I would however like to point out that it seems unlikely to be caused by any external network segments due to the fact that as soon as the traffic spikes on only one of the three external interfaces, voice quality drops for all of them. The only common network equipment that is directly managed by us is the Juniper SRX 550. We have done a lot of investigation on this equipment, have done measurements and looked in detail at all of our bandwidth monitoring, even to the interface level. We are not able to see any degraded performance on the equipment and even the CPU load is only at around 30% during the time of the last incident (in fact the all time high CPU performance, unrelated to these events was 50%).

I do not want to disregard any problems on the network and generally a problem like this would point directly to network congestion. However even given all the research we have done we are unable to find anything on our own so for. Network congestion can be caused by a variety of factors including WAN links, firewall/router performance, an ethernet switch or even any of the servers that are involved in handling the traffic (possible resource problems etc).

I really appreciate your help on this! thanks!
 
Hi guys, hope someone can take a look on this.

Thanks!
 
I would look for QoS drops and see if that EF traffic is defined properly.

It would be a little messy to wireshark an SBC and pick 1 of 400 bad streams, but it should be doable if that's the only option.
 
>This leave us the possible issue within the CM. But I checked the DSP resources and it's wide enough (around 1292 DSP's) distributed to 4 G450's under the same network-regions.

I'd look at the network connections to the G450 devices. can you check the ports for errors & collisions

I'd also look at all the ports for errors, clear the counters and check again after the next outbreak


Take Care

Matt
I have always wished that my computer would be as easy to use as my telephone.
My wish has come true. I no longer know how to use my telephone.
 
Thanks kyle! I'll check that one with our network engineer.

Hi Matt - this is quite interesting and give it a try. To add, I remembered that when the issue happened, the music-on-hold and whispers were also affected. To think that all G450's are just loaded with a VAL board media module and the DSP resources.
 
Can you also confirm that

less than 400 calls - all calls good
more than 400 (say 650) calls - all 650 calls bad

or

less than 400 calls - all calls good
more than 400 (say 650) calls - first 400 calls good, remaining calls bad



Take Care

Matt
I have always wished that my computer would be as easy to use as my telephone.
My wish has come true. I no longer know how to use my telephone.
 
Hi Matt,

I just checked our G450's and can't see any port errors or drops. However I'll ask the switch side as well.

The 1st scenario applies.

less than 400 calls - all calls good
more than 400 (say 650) calls - all 650 calls bad including announcements
 
The 1st scenario applies.

less than 400 calls - all calls good
more than 400 (say 650) calls - all 650 calls bad including announcements

That does sound more like EF starvation.
That is the routers are policing the EF traffic and once the policer rate is hit, they will drop all EF marked packets until the rate drops below the policer rate. Obviously, this depends on your providers configuration and the configuration of your equipment. You'll see this as lost packets on the voice stream

I'd ask the provider what EF rate you have on the circuits - which will (most likely) be less than the total bandwidth of the circuit. Compare this to the rough calculation of bandwidth in use (I use 100kbps for G711 to allow for overhead etc)

Take Care

Matt
I have always wished that my computer would be as easy to use as my telephone.
My wish has come true. I no longer know how to use my telephone.
 
I would look at the data switches on both sides of the SBC. Regardless of the line speeds they "support" they very often do not have enough buffers to avoid output drops during heavy loads.
 
Thanks guys for your advise! We're still checking the network side. However we were able to replicate the issue by sending SIP calls to our system. And we noticed the following:

1. During the audio issue yesterday, we noticed this alarm in Avaya SBC. Although our call traffic is not really that high.

"Max Concurrent Audio Session Limit Reached"

We saw this release notes but right now our SBC version is 6.2.1.Q18.

List of the issues fixed in 6.2.1.Q05 from 6.2.0 SP5
AURORA-1586 Calls blocked by "Max Concurrent Audio Session" erroneously reached



2. We were able to simulate the issue by sending SIP traffic w/ RTP and we found the below logs.


Event Event Event Event First Last Evnt
Type Description Data 1 Data 2 Occur Occur Cnt

706 No VOIP Resource 2 39 06/05/03:43 06/05/05:43 255
3706 No VOIP Resource 1 26B 06/05/03:43 06/05/06:07 255
2093 Can't start announcement 1 26B 06/05/03:43 06/05/06:07 96
3708 No time slot on MG 2 39 06/05/05:32 06/05/05:43 40


- Not sure what resource is it referring but upon checking, the DSP's are not being maxed out. Although I saw an * on the G3 and G4 which is physically located in our remote redundant site.




IP DSP RESOURCE H.248 GW SUMMARY REPORT
G711 Equivalent DSP Total GW
GW GW Peak Net Rsrc Rsrc Usage IGC DSP IGC Denied % % Out
Num Type Hour Reg Capty Peak (Erl) Usage Pegs Pegs Pegs Den Of Srv
G1 g450 500 1 320 249 117.6 0.5 2522 44 0 0 0
G2 g450 500 1 320 249 119.6 0.2 2573 27 1 0 0
G3 g450 500 1 320* 248 87.6 0.1 2105 25 0 0 0
G4 g450 500 1 320* 236 92.2 0.1 2108 20 0 0 0
G5 g450 500 5 80 3 0.1 0.1 23 3 0 0 0


The “*” indicates that the media processor capacity changed during the measurement hour.



HARDWARE ERROR REPORT

Port Mtce Alt Err Aux First/Last Err Err Rt/ Al Ac
Name Name Type Data Occurrence Cnt Rt Hr St

005 MED-GTWY 769 0 06/05/04:15 1 0 1 r y
06/05/04:15
003 MED-GTWY 769 0 06/05/05:10 1 0 1 r y
06/05/05:10
003 MED-GTWY 769 0 06/05/05:12 1 0 1 r y
06/05/05:12
003 MED-GTWY 769 0 06/05/05:32 1 0 1 r y
06/05/05:32
003 MED-GTWY 769 0 06/05/05:42 1 0 1 r y
06/05/05:42
004 MED-GTWY 769 0 06/05/05:43 1 0 1 r y
06/05/05:43
004 MED-GTWY 769 0 06/05/05:47 1 0 1 r y
004 MED-GTWY 769 0 06/05/05:50 1 0 1 r y
06/05/05:50
004 MED-GTWY 769 0 06/05/05:52 1 0 1 r y
06/05/05:52
003 MED-GTWY 769 0 06/05/05:53 1 0 1 r y
06/05/05:53
003 MED-GTWY 769 0 06/05/06:07 1 0 1 r y


Error Type 769 is a transient error, indicating that the link has unregistered with the Media
Gateway. If the Media Gateway re-registers, the alarm is resolved. If the Link Loss Delay
Timer (LLDT) on the primary server expires, Error Type 1 is logged.


- The Timestamps of the MGW alarm looks instantaneous. It could be a glitch? But I checked the ethernet ports of all MGW's and I didn't see any drops or collision.


HARDWARE ERROR REPORT

Port Mtce Alt Err Aux First/Last Err Err Rt/ Al Ac
Name Name Type Data Occurrence Cnt Rt Hr St

100 SIP-SGRP 3585 11 02/02/04:52 255 0 0 n n
06/05/06:07
101 SIP-SGRP 3585 11 02/02/04:52 255 0 0 n n
06/05/05:51
102 SIP-SGRP 3585 11 02/02/04:52 255 0 0 n n
06/05/05:50
210 SIP-SGRP 3585 11 06/04/10:29 27 1 3 n n
06/05/05:43

Error Type 3585: IP Signaling Far-end Status Test failed. The far-end is not available. See
IP Signaling Group Far-End Status Test (#1675) for more information.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top