UCM DOWN, 1-way audio, dropped calls, audio fade in and out 2

MitelInMyBlood · Mar 31, 2011

CUCM 8.5, 3 nodes (1 Pub & 2 Subs)
second Sub recently added (middle of last week)
1400 users

First report 2 days ago (this past Tuesday), 1 enduser reported a dropped call, her phone (7942) shows UCM fail then goes to re-registering

Yesterday (Wednesday) about 5 more complaints, various but similar symptoms, some reporting 1-way audio, some reporting dropped call.

This is occurring internally within the local network on station to station calls as well as on external (trunk) calls. On internal failures 1 party will see "UCM Down" the other party will see "Fail".

Last night we forced everything over to the *Pub* but today (Thurs) we're still getting isolated reports.

Recent changes: Last week we added the 2nd sub. The original sub was 150 miles away and prior to adding a local sub everything was registering to the local *Pub* (bad design I know, but I didn't do it, the VAR did) - anyway after adding the 2nd sub (which is now local to us) we moved everything over from the Pub to the local Sub. That was a week ago today.

No problem reports last Friday or this past Monday. Problem reports started coming in on Tuesday of this week, 5 days after introducing the new (local) Sub to the mix and moving everyone over to it.

Ideas welcomed.
Thanks!!

Original MUG/NAMU Charter Member

VinceWhirlwind · Jun 5, 2012

Agent6376:
" I suspect that the OP's issue was resolved by changing the ARP timers because the CUCM was learning too many."

But the OP said they changed it from the non-default 5-minutes to the Cisco default 4 hours, which fixed the problem.

So learning *too many* can't have been the problem.

Seeing as the problem co-incided with a CCM upgrade, I'd be interested in looking into how the CCM has changed the way it discovers handsets and how it reacts to a handset becoming unknown.

VinceWhirlwind · Jun 6, 2012

Just thinking about this - the calls were getting interrupted, right?

- handset to handset, ethernet path, all switches are seeing a continual flow of packets between them, so no ARP entries are going to time out relating to that traffic flow.
- The Wireshark traces showed "intermittent packet loss" - would be good to know what this actually means and whether it can be related to the call's QoS stats
- a bit of packet loss doesn't end a call/cause a tear-down to be sent.
- the CUCM log shows the Call Manager unregistering the phone
- this isn't a call being torn-down, it's a handset, active in a conversation, being unregistered from the system.

So you "fixed" this by fiddling with ARP timeouts.
Ethernet works just fine, regardless of ARP timeout value, so it could be an issue of configuration on your network in relation to frame flooding, or an application problem within the CCM.

I haven't played with CCM for a long time, so this question is very vague, but what are your timeout settings within CCM for handset registrations?
Could it be caused by a weird combination of low handset registration timeout, lower ARP timeout, and some kind of flood control config?

Having said that, running a new VoIP system on a network run on an ancient IOS is a bad idea, as is bodging your entire LAN switch configuration just to allow the server guys to perform unplanned cowboy work.

In this situation, I would be trying to convince Management that an 8-year-old network core is due for a refresh, and get them to fork out for a nice new VSS pair of 6500s, PLUS, a gentle re-architecture of the network to provide a proper distribution layer, especially, get all the server connections out of the network core and onto Server Data Centre switches.

MitelInMyBlood · Jun 7, 2012

Thanks Vince.

We recently purchased a pair of 7000's and brought in some hired guns from some (Gartner Group recommended) 3rd party outfit (not Cisco) to do a full network re-architecture for us (long, long overdue). I believe this path was chosen so as to safely distance us from Cisco's zeal to up-sell us $50MM in new network hardware and influence the decision of what truly needs to be done. It may well cost us $50MM in the end, but with recommendations coming from networking pros with no skin in the game it's a lot easier to believe what they tell us vs listening to the VAR.

Re the comment about the ancient IOS, again you're preaching to the choir. The "data" network just ran and ran for years on end. Then when we wanted to migrate the phones to VOIP someone blurted out some statement like "COS is all you need" and suddenly the bus was loading to take us all to Abilene (Abilene Paradox - how we all got to Abilene when no one wanted to go), AKA "how to turn peanut oil into jet fuel"

Whatever the actual root cause of the problem, resetting the ARP timers back to default values either solved it or sufficiently masked the issue, for now anyway.

Original MUG/NAMU Charter Member

layer8ninja · Oct 10, 2012

I'm seeing a similar issue at just one of my 10 mpls sites. Small office with just 10 phones, connected to hq via mpls over a t1 circuit. It is so sporadic but frustrating.

Where exactly do you change arp timers? My switches are 2960's and didn't see a similar command except for mac-address-table aging-timer.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

UCM DOWN, 1-way audio, dropped calls, audio fade in and out 2

MitelInMyBlood

Technical User

VinceWhirlwind

Technical User

VinceWhirlwind

Technical User

MitelInMyBlood

Technical User

layer8ninja

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor