Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Phones reboot daily - HeartbeatRecvTimeoutHandler.cpp;273

Status
Not open for further replies.

Madikus

IS-IT--Management
Jan 15, 2013
6
0
0
US
Hi,

We are in the middle of a new implementation of a 3300. Everything is up and working (Mitel did the entire install) with one exception. We haven't gone live yet because every phone in any of our locations connected to our data center over MPLS experiences a heartbeat timeout randomly, at least once a day (this does not happen to devices connected inside the data center). This causes the phones to reboot, drops the network port and disconnects the connected PC. The phone immediately reboots and we are good for the rest of the day. Mitel has been working on this for 3 months without a resolution. They continue to blame our Sonicwall, our network configs and our virtual hardware. The Sonicwall is our central gateway and has all the subinterfaces built for our VLANs. SonicWall voip engineers have looked at our configs from top to bottom and insist it isn't the SonicWall causing the issue.

The phone reboots happen at randoms times from random locations (we have 15 remote sites, all connected via Netsolutions MPLS). The only thing consistent is that it happens to every location at least once per day. Sometimes more. Other than this random reboot, the phones work perfectly. All locations have roughly 15 phones. They are connected using Mitel supplied HP Procurve PoE switches. The power consumption is nominal. The handsets are 5330's. DHCP is handed out via each locations Adtran MPLS router. We've disabled ICMP redirects on a few of the locations with no success.

Here's an example description:

Code:
ICP has lost contact with (10.0.122.4 (08-00-0F-6D-38-16)), eAtlas cluster pool free: 6144 and low water mark: 5972

We have tried giving a couple locations priority on the Sonicwall for all traffic bound for the controller at the data center without success. We've increased TCP and UDP timeouts for all traffic to the controllers without success. We've isolated all controller, mpls and gateway nics to the same physical switch with no success.

Any suggestions? I'm out of patience waiting for them to figure it out...
 
Forgot to add:

DHCP lease is set to 7 days. LRO was adjusted per Mitel bulletin. All MiNet traffic bound for the vMDC is priority 1 on the firewall.
 
Wow I don't think I have the expertise to help on something like this however I would say it has to be something in your setup given you say it does not happen on anything within the datacenter. What about running the phones as teleworkers ( through an MBG ). Teleworkers might not be as strict regarding the heartbeat as they typically connect over the internet where there is no QoS. I know all the phones in our office are run off head office over MPLS as do all our remote offices but I don't have any visibility of the setup past the cisco router that connect us to MPLS.

I'd tell you a UDP joke but I'm afraid you won't get it. TCP jokes are the best because you always get them.
 
Are you still having this problem? We just started having this problem after a recent upgrade of our 3300 ICP. My network is not as complicated as we have two buildings connected through 2 pairs of fiber.
 
We had the same heartbeat problem with our 3300 environment a few years back and figured out that it was due to a scheduled nightly reboot of our Watchguard Firebox devices (routers) that connected our separate sites. After disabling the scheduled reboots, the heartbeat errors went away.

Do you see any heartbeat errors for phones that are local to the site where your actual 3300 controller is located or is it only happening to phones at the remote locations?
 
We don't have any phones yet at remote locations. All of our phones are in one of two buildings connected locally.
 
Maybe try setting the PoE priority to high on a few of the ports on your switches and checking to see if those specific phones reboot along with the others? I'm no Mitel expert, just an intermediate user, so I can't really give you much advice as far as the 3300 goes, but it does sound like some device between the 3300 and the phones is the culprit.

We have, however, had quite a few hardware issues with the 3300's themselves that have caused really odd problems (random 3300 reboots, etc). Failing hard drives, failed power supplies, and the like.
 
Thanks for the info guys. Turns out we were chasing the wrong issue. The Adtran router (which was giving DHCP locally to the devices) was not renewing leases, causing the phones to reboot, causing the heartbeat check to fail. We moved DHCP to a domain controller at the data center and the problem went away. Mitel is dumbfounded and Adtran cannot provide us with any answers, so this is our solution. I have to say, I'm really disappointed with Mitel from top to bottom. I rolled out about 4000 cisco IP phones with callamanger a few years back and it was nothing like this. Anyway, thanks again for the ideas.
 
Is it Mitel, or Netsolutions or your dealer you are disappointed in because this forum is probably full of people like myself who have deployed numerous Mitel MCD phone systems without any issues what so ever? Ultimately they were correct in their original diagnosis in that is was not the MCD controller that was the issue, it was the Adtran. Why do you feel that is Mitel's fault?

I'd tell you a UDP joke but I'm afraid you won't get it. TCP jokes are the best because you always get them.
 
Bet the DHCP server in the Mitel is better than the adtran router one as well.....
 
It seems to me that this is the same as a known issue that happens on Small Cisco routers where the DHCP server does not respond to the lease renewal, Thought Mitel may have spotted it sooner. Never seen it published but have been quoted it several times over the phone.
 
LoopyLou,

We deal directly with Mitel and Netsolutions (same company). No 3rd party reseller or implementation team. I am disappointed because they sold this entire package to us as an off the shelf solution that would take 3 months to deploy. They began the project 11 months ago. All the networking equipment is theirs, all the configs are theirs, and every step of the way, absolutely everything that could go wrong has gone wrong (at least it feels that way). So you ask why I think this is their fault - the answer is simple, it is. We didn't change anything from their recommended build - they designed it top to bottom, we signed the paperwork and gave them access to the facilities they needed to do their job.

Anyway, my intent certainly wasn't to nitpick them. 11 months into a 3 month project and they finally have us turning up sites on the new phones.
 
I'll also add this - now that it's working, it's great. The implementation teams have been working around the clock to get it all dialed in now that it's in production.
 
Not trying to be defensive but Network design is only 10% of the equation.

Network Implementation of the design is the meat and potatos (so to speak)

From my experience, Mitel will not touch your network. They will only tell you how it needs to function.

The failure you experienced was outside of their responsibility and it is perfectly reasonable for them to stand back and wait for you to repair whatever is broken in the network.

I can't tell you how many times I've been in this exact position and had to insist that the network was at fault before someone would even troubleshoot their network.

Basically, put yourself in their shoes. How would you have done it differently?

**********************************************
What's most important is that you realise ... There is no spoon.
 
The problem is that it was their network design and equipment. Netsolutions provided the adtran router, did all the programming on it, shipped it and Mitel installed it. Netsolutions's datanet division programmed all the switches (every switch in every building was provisioned and installed by them). The only portion of the network we control is the firewall. The rest is their equipment and programming. And in my experience, Mitel and Netsolutions have touched every aspect of our network with the exception of the firewall.

Anyway, I'm taking this thread off notice - I appreciate all the feedback and suggestions.
 
Then your scenario is very unusual in every respect by my experience and you have every right to be disappointed.

Something like that is relatively easy to troubleshoot IMHO, especially by those familiar and with access to all components.

Hopefully now things will take an upswing for you.



**********************************************
What's most important is that you realise ... There is no spoon.
 
This proves the rule: If you want it right, do it yourself. My 2c
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top