Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

UCM DOWN, 1-way audio, dropped calls, audio fade in and out 2

Status
Not open for further replies.

MitelInMyBlood

Technical User
Apr 14, 2005
1,990
US
CUCM 8.5, 3 nodes (1 Pub & 2 Subs)
second Sub recently added (middle of last week)
1400 users

First report 2 days ago (this past Tuesday), 1 enduser reported a dropped call, her phone (7942) shows UCM fail then goes to re-registering

Yesterday (Wednesday) about 5 more complaints, various but similar symptoms, some reporting 1-way audio, some reporting dropped call.

This is occurring internally within the local network on station to station calls as well as on external (trunk) calls. On internal failures 1 party will see "UCM Down" the other party will see "Fail".

Last night we forced everything over to the *Pub* but today (Thurs) we're still getting isolated reports.

Recent changes: Last week we added the 2nd sub. The original sub was 150 miles away and prior to adding a local sub everything was registering to the local *Pub* (bad design I know, but I didn't do it, the VAR did) - anyway after adding the 2nd sub (which is now local to us) we moved everything over from the Pub to the local Sub. That was a week ago today.

No problem reports last Friday or this past Monday. Problem reports started coming in on Tuesday of this week, 5 days after introducing the new (local) Sub to the mix and moving everyone over to it.

Ideas welcomed.
Thanks!!

Original MUG/NAMU Charter Member
 
Back & forth, now TAC is again leaning toward the network. Opened a new case, this time w/Backbone, looking at the core (two 6500's w/ a couple 10g blades) they noted sporadic high CPU activity at one point hitting 96% and several hits in the 80 percentile over the past 72 hours. Something's definitely afoot, the core normally averages 15% CPU utilization occasional peaks never above 25%. Took a mtce window late last night, made sign of the cross & rebooted the core. Have to wait now for Monday's call volume to pick up to see if this made any improvement. (problem not seen when light traffic volume)

The 1-way audio is both internal (peer-to-peer) as well as external, set to PRI. Initially the call starts out fine, several minutes into the call the receive audio begins to garble and is subsequently lost for up to 15~20 seconds... 1 instrument (the one with loss of receive audio) also displays "UCM Down Features Disabled" indicating it has lost a path common to it's peer and to the CUCM (keep-alives lost). Wireshark traces have caught several of the events, confirming packet loss. Thought about spanning tree reconverging, but that's not it.

Interestingly sets connected via either of 2 wiring closets on the 3rd floor (physically nearest the CUCM) have never lost receive audio nor had UCM Down error, but have been in conversations w/peers on other floors/wiring closets where the reverse is true, always experiencing the failure inbound to them. (different networks, different voice Vlans). Not sure what this is telling us.

We (phone techs, network techs and 3 onsite SE's and TAC) are now fairly well convinced it's an obscure network issue, but my management still leaning toward it being a CUCM Load issue & lobbying us to "go back" to 8.0 because their feet are in the fire (and the problem of course has manifested itself on the phone system, where it logically would be seen first). We're presently dropping or experiencing intermittent 1-way audio on at least 1/4 to 1/3 of all call traffic (at 2500 calls per hour peak) on 1400+ lines. Users are screaming. Senior mgmt screaming. ***FIX THE F'ING PHONE SYSTEM*** Only problem with "going back" is it's now been 25 days since 8.5 loaded and we have a very dynamic database, lots of daily MACs & just finished a huge call tree 2 weeks ago with several CTI RP, translations and a couple hunt groups. "Going back" from a reprogramming perspective would be disasterous, and easily take 3 days at the keyboard to rebuild all that's changed.


Original MUG/NAMU Charter Member
 
All TAC cases this week are P2

Original MUG/NAMU Charter Member
 
Sunday 4/10:

I now have 4 test phones strategically located around the building in wiring closets where the 1-way audio, etc has been most prolific. Calls between each other are nailed up & there are 4 laptops w/wireshark running to hopefully capture something.

The traces previously captured definitely show intermittent packet loss, as tho QOS wasn't enabled, tho it definitely is. Packets aren't delayed or being resent, they're just plain gone. Wireshark is seeing it, but wtf's causing it???

Solar Winds (Engr Tool Set 9.x) w/network monitoring enabled on literally every up & downstream backbone fiber interface as well as on the core is not showing any congestion or latency anywhere on the Lan. Nothing above 30% utilization and most interfaces coasting at 10~15%. Is SW telling us the truth?

The IOS on the core is unarguably a museum-piece(12.2.18.SX.D9) circa 2K4 & desparately needs upgrading but has been stable for so many years that an upgrade will require special dispensation from many gods for us to push a new one to them.

Everyone is looking at each other and asking "what changed?" because CUCM had been running without issue since it was turned up last July. Even the recent upgrade to CUCM 8.5(1) ran flawless from 3/15 until 3/29 when the first report of trouble was called in. Since 3/30 our lives have been a living hell. Blank check for overtime, but not the kind of OT anyone wants. We're all about to have a stroke or heart attack. This office ran nearly 24 consecutive years on the old Mitel SX2000 without more than an hour total unscheduled downtime. 9 months into Cisco & our feet have been the fire for all of the last 2 weeks & the decision-makers that promoted this thing are looking to duck & cover. Rebooting the core Sat'dy didn't fix it.



Original MUG/NAMU Charter Member
 
is 12.2.18.SX.D9 CAT-OS? I would bet you money that it is not on the compatibility matrix with CUCM 8.5 as it is too old. therefore not supported.
I don't know a thing about solar winds network monitoring tool but it obviously cannot be trusted.
In your words:
" looking at the core (two 6500's w/ a couple 10g blades) they noted sporadic high CPU activity at one point hitting 96% and several hits in the 80 percentile over the past 72 hours."

Obviously solarwinds did not register that either so stop relying on it for this issue.

Bottom line the issue is not with CUCM but with a faulty network. You are in a converged network now and everything affects averything.
The network is dropping packets and that is a fact. So until that is fixed the phone system will suffer. You will need to find out where those packets are being dropped and eliminate/fix that device. I wouldn't be surprised if it is your IOS on the core.
Upgrade the 6500's and see what happens. I can't believe TAC has not asked you to do this yet. Something is failing on it whether it is IOS, a blade or the backplane is becoming an issue. Hard to say.
 
FOLLOW-UP

The fire is out, but its origin seems strange. A year or so ago we had forced the ARP timers on all of our LAN switches (3560/48) to 300 seconds (5 minutes).

Rather than take a shotgun approach & then never know what fixed it, we applied TAC's recommended changes en-masse, then gradually one-by-one backed the changes out gradually over the following week until we started getting complaints again.

The ARP timer turned out to be the trigger. Cisco's default value for this is 4 hours.

According to our network engineer we had set this to 5 minutes to more or less accommodate the frequent comings and goings of various servers being changed out, but needing to reuse the old address for the new hardware. In this way the server team could go about their daily routine without having to pester the network team to flush the ARP cache every time a server or other piece of hardware was replaced. Okay, in retrospect maybe this wasn't such a good idea.

What puzzles me is why we were able to run 1400+ VOIP phones in this environment for 11 months without issue and only shortly after upgrading CUCM to 8.5(1) from 8.0(2) did the problem begin showing up? IOW, what's changed in 8.5 making it so susceptible to ARP activity?



Original MUG/NAMU Charter Member
 
Assuming this is a rhetorical question here is my thesis:
lowering that timeout would most likely increase CPU load and if the CPU load increased enough would potentially affect the stability of the 6500 and thereby the stability of any network which depends on that device.
You mentioned that the 6500 were running at very high CPU during those incidents so there is your possible cause.

Why now and not earlier? Who knows? Maybe it coincided with the upgrade but unrelated to it (I'd bet beers on this).

By the way I've seen ARP issues cause total havoc on networks and that's what you experienced first hand.

I'm glad it's fixed though.
 
Thanks.

By the way only as a note in passing, all 1400+ phones are registered to the PUB not to the sub. This has been the configuration from day-one. Our VAR did this to us..... We have a SUB but it is not local, instead located offsite, 150 miles away in a Data Foundry hot site. It's online, but nothing was ever registered to the (remote) SUB and still isn't.

We later learned (not from our VAR) that this was most likely an unsupported configuration and on the advice of 2 CCIE's we ordered another Sub to install here in the same rack as the PUB (where the sub logically belongs)

We did that, changed the fail-over order and actually forced-over the 1400+ local instruments to the *new* local SUB as one of the steps in upgrading the system from 8.0 to 8.5 However, the fit hit the shan 2 weeks later w/one-way audio and UCM down, etc. and so as part of a shotgun approach we forced everyone back to the PUB and physically shut down the new Sub. This is how it remains today, still offline, cold in the rack, with everything running on the PUB and the only running Sub still 150 miles distant.

We took such a whipping & these intermittent one-way audio drops were so high profile and so very disruptive to our business that our management is understandably gunshy about re-introducing that new (local) Sub back into the mix. By contrast I think we're living on borrowed time with everything registered to the PUB. I'd sleep better with that new Sub up and running and the phones registered to it as they should be, but I'm unsure how to do it myself (I'm barely qualified to do daily MACS) and feel like I really can't trust the VAR. I feel like my VAR is TFW, totally (expletive deleted) worthless.

Thoughts?



Original MUG/NAMU Charter Member
 
Hi Mitel.. We also recently upgraded to 8.5 from 6.1 and also getting the exact issues you are encountering. Phone resets on its own with or without a call, One way audio, Call history issues, devices going dummy up after a reset. Nothing prior to 8.5. We have 7 Clusters with over 80 Subscribers and close to 120k devices (Yes 120k). Like you everything has the look and feel as if the network is the issue. But we've looked and seems to be running as it should. I will check the issue you ran into with the ARP table timers. I feel like that calm before the storm, but its not a normal it will go away storm, its like the "Perfect Storm". Who the f$%#$% does Cisco have working for them? Where did the get this piece of s#$^$. Cisco used to stand for Stability, but not now. Mitel let me know if you find anything else with this will you plz. thx for the details and info.
 
is the issue on all the phones and all the clusters?
Your description is very generic unless you are just here to vent.
Did your phones upgrade firmware during the upgrade?
Nothing else happened on your network that would be the same time as this? No other network upgrades or changes?

 
Whykap i am assuming your talking to me and not Mitel.. if so, so far just happening to two of our largest clusters. Have not heard of anything from the other Clusters, but that not to say the users are just putting up with it.

Our Phones firmware were pre-upgraded from 8.5 to the current 9.1 prior to any of the CUCM upgrades.

No other network work/outages reported by our Network group. The smaller clusters seem to run fine with no issues, but the larger ones seem to come up with weird things/symptoms that has never happened before, prior to 8.5 upgrade..

Symptoms; Phone resets on its own, 1 way audio, Call history not consistent, Ringlist and our Company directories show "Host not found" error. These symptoms are very sporadic and maybe happening to more end users, but maybe not reporting because of the recent activities.

The Larger clusters have a Centralized TFTP design (one meaning 2 TFTP Server's in the DHCP option 150 scope, and those have the Alternates defined.) This Design is shared between 3 Clusters 2 of the first ones running CUCM 8.5(1), and the last running 6.1(4). its very hectic here since we have been doing the upgrade this past 6months (7 cluster upgrades in 6months).

Yes multiple TAC's opened and same response, give us the log/s. We did just like Mitel above they see a network disconnect. I see some major alterations to CUCM 8.5 which i am not happy with. The key is ONLY after the CUCM 8.5 upgrade did this start happening.



 
Are any phones on the 6.X cluster experiencing similar issues?
 
CISCO4ME:

Our "fire" is out and the incorrect setting of the ARP timers was determined to be the cause, as we felt like we had eliminated everything else. We believe that some obscure coding change in release 8.5 may have triggered it, but that's only a guess. Our severely reduced staff size and day-to-day workload does now allow us the luxury of spending time to pinpoint the actual root cause.

Our management was adamant this had to be a "phone system" problem and not a network problem, as everyone in mgmt. was still thinking from the old TDM point of view, where all calls passed through the PBX switch fabric for the duration of the call. In their mind 1-way audio and calls dropping had to be the PBX's fault, especially since we had just upgraded from 8.0 to 8.5. It took a lot (plural days) of whiteboarding and bringing in several CCIEs and opening P1 TAC cases, including Routing & Switching & Backbone to get the decision makers to finally understand how VOIP systems work.

Once they (mgmt) finally accepted this they were still pointing fingers at the possibility of the new set firmware being the cause, so to disprove this they had us downgrade all the instruments to the previous firmware load. This went poorly by the way, as it caused every user to lose all of their phone customizations (custom ring settings, background images, etc.) and still did not solve "the problem".

We wound up with laptops stuck in wiring closets all over the building running WireShark. Traces seemed to point to network congestion even though our network traffic never exceeded 40%.

With each passing day the fires were becoming hotter and of course the brand new phone system was directly in the crosshairs and being perceived as a complete POS. It no longer mattered whether it was a "network issue" or not. The end users saw it for exactly what it was; after more than 20 years with a "stable as Jesus" phone system, suddenly their NEW PHONES no longer worked. The month of April 2011 won't soon be forgotten around our place.

Through most of this crisis we had been trying to methodically troubleshoot & try different approaches, even putting our execs' phones on a separate, dedicated, non-routed network straight to the CUCM. Of course that put out "their' fires which had the additional benefit of giving credence to the original diagnosis, that this was never a phone problem, it was always a network problem.

I do not know when or who came up with the idea to put the ARP timers back at their factory default settings. It was part of a "shotgun approach" that seemed to finally "fix it" so for several days no one knew exactly what the silver bullet was until we (1 step and 1 day at a time) began undoing the shotgun attack. Imagine everyone's surprise when we finally got down to the ARP timers. Within 1 hour of putting them back to the 300 seconds (5 minute) setting they had previously been set at (for a couple years without issue or incident) all of a sudden the 1-way audio and UCM down problems were back with a holy vengeance.

OK, ARP timers. - but now we've got a phone system that still to this day remains torn asunder, with execs on dedicated fibers running back to the phone room and the 6500 core split (no longer redundant) and everyone walking on eggs. We're satisfied that the ARP timer setting was the most likely cause, but we're all still licking our wounds from last April and gunshy of even touching what's working. We're even barred from making any further system OS upgrades, even in the face of known vulnerability CERTS. You walk past the CORE and feel like you need to make the sign of the cross..... Breaking a hetofore bulletproof (and business vital) phone system in the corporate headquarters of a Fortune-50 energy corporation has that effect on you.

Bear in mind that you have only a very limited window of opportunity "to go back" to your prior load and with each passing day the amount of rework that will go along with a rollback grows exponentially. Every MAC you've done since the upgrade will likely have to be redone. Unless you're archiving your daily backups, you'll no longer have a viable 6.1 database to go back to in a matter of (I think) 10 days - could be less, I forget.

Good luck.

Original MUG/NAMU Charter Member
 
Can you maybe share the tac case number here, or the other suggestions tac made besides the arp change that corrected the problem?

We are having a very similar problem, with cucm 8.5.1 and have not yet found the problem, tac says it is not cucm, we get the tcp keepalive socket error.

Thanks
 
Mitel & all,

You made me so scared with this adventure that I want to run away from my present job as we are planning to migrate to CUCM from Aastra (Former Ericsson) MD110 system in coming weeks. I have burnt my fingers few years ago while implementing VoIP & to covince my users thats its network issues & to be sorted in a different approach. It has its advantages but when you have issues, you need 100 wise guys to bring you out of the sh.. This thread has made my belief stronger. Also the fact that I have limited skills on networking & serious doubts on the design of our current network is already causing sleepless nights. Current topology seems to be far behind for handling this convergence of Voice, data & Video.

Anyhow guys wish you all goodluck & do pray for my adventure. In fact i will very much rely upon your experience & knowledge in the coming weeks..

Cheers,

""The truth about action must be known and the truth of inaction also must be known; even so the truth about prohibited action must be known. For mysterious are the ways of action""
 
The dust having long since settled and having had considerable time to go back and revisit all the various things we tried, we are today 100% certain that the problem was (is) network related and putting the ARP timers back to factory default has merely masked a larger network design issue that most likely is still lurking out there and will one day bite us again. At Cisco's recommendation ($$ naturally) we today are in the midst of replacing our 6500 (redundant core) with a pair of new 7000's. We will also be conducting a network rearchitecture starting sometime in the 1st quarter of 2K12.

We are also (today, finally) 100% certain that CUCM (and the 8.5.1 load) had nothing to do with the problem beyond anecdotal coincidence (and finger-pointing). The problem did not appear until a week or more after 8.5.1 was introduced. Granted that the new load was introduced around the Easter time period, over which a number of people were out of the office on holiday leave and network activity was therefore low.

CUCM did not (ever) go down. RTMT was not reporting any unusual call volume. After deploying a half dozen packet sniffers we occasionally began to see some of the dropped packets, but with blowtorches blazing up our backsides (in the corporate office of a Fortune-500 company) I have to admit some of our work was shotgunning rather than based on any carefully-detailed scientific forensics. (When you're up to your neck in alligators, it can be awfully hard to focus on draining the swamp). Things were especially high-profile as we had only recently replaced a 20+ year rock-stable/dependable (Mitel) system with what the customer suddenly perceived as junk. It did not matter that the data network was ultimately at fault - what the customer saw was that their phones were no longer dependable. An absolute public relations nightmare.

To anyone going through this the only advice I have is to remain focused. VOIP by its very nature is an extremely delicate communications medium. If a gnat farts anywhere in your network, your VOIP phones will be first to smell it. Stay focused... it's a network problem.

To anyone planning a VOIP phone system rollout, please for goodness' sake, bite the bullet and pay the $$ to have your network architecture professionally assesed and follow the recommended course of action. - at least that way you'll have some recourse if the fit hits the shan.




Original MUG/NAMU Charter Member
 
Dear All,

Kindly note that we are facing similar problem and have identified the root cause(hopefully). Our CUCMs 8.5.x are behind 65xx with FWSM Module. Whenever we add/delete/change any FWSM rule and hit SAVE/APPLY, it resets 40% of phones randomly. Will keep you posted once we get any resolution from CISCO as TAC case is still open.

I dont understand why other 100s of applications are working fine (thank God) and only IP Phones are giving problem.

I can only request you all at this time to have a maintenance window in your environment and add/delete/change a rule on FWSM (as we have) and apply the change AND same time monitor if phones get reset.

Hope this helps.

Regards,

Mohammad Ali
 
This is to all I hope helps. The problem here has been found. I caused very much grief to me and my whole department.

Since I received the notification of the last post by Mohammad I thought it necessary to let you know what we found.

Our environment has the exact same CUCM 8.5, behind 65xx with FWSM.

Our problem was that our CUCM is in the data center, multiples of our server staff were running backups of applications on the production network.

Our actual firewall/FWSM interfaces were getting congested over 100% causing very eratic problems with IP phones and remote voice gateways registered to the CUCM subscriber behind the firewall.

We removed all the backup traffic and are replacing the FWSM with ASA devices.

This problem was a killer.


John
 
Something interesting to note on page 61 of the CUCM SRND

"The recommendation to limit the number of devices in a single Unified Communications VLAN to
approximately 512 is not solely due to the need to control the amount of VLAN broadcast traffic. For
Linux-based Unified CM server platforms, the ARP cache has a hard limit of 1024 devices. Installing
Unified CM in a VLAN with an IP subnet containing more than 1024 devices can cause the Unified CM
server ARP cache to fill up quickly, which can seriously affect communications between the Unified CM
server and other Unified Communications endpoints. Even though the ARP cache size on
Windows-based Unified CM server platforms expands dynamically, Cisco strongly recommends a limit
of 512 devices in any VLAN regardless of the operating system used by the Unified CM server platform."

We ran into a similar issue and had to segment the broadcast domains. I suspect that the OP's issue was resolved by changing the ARP timers because the CUCM was learning too many.

Hope this helps someone.

 
Thanks Agent6376

What puzzles me is why we were able to run 1400+ VOIP phones in this "toxic" ARP environment for 11 months without issue and only shortly after upgrading CUCM to 8.5(1) from 8.0(2) did the problem begin showing up? IOW, what changed in 8.5 making it so susceptible to ARP activity? Something had to have changed.

We recently moved on to 8.6.2(a) and believe me, with last year's experiences still fresh in our minds, everyone's sphincters were tightly puckered when we switched over. Fortunately no major issues and we've been on 8.6.2(a) for a couple months now.

As you can probably imagine, our shop was a really scary place to be a year ago last April when all the trouble hit. Fortunately cooler heads prevailed, but there was a point last year that our Cisco Acct. exec. was afraid to show his face for fear that our Sr. execs were going to tell him to yank it out.

Original MUG/NAMU Charter Member
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top