Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Losing UDP or 3-digit dial ability over network on IP Office on COMCAST MPLS network?

Status
Not open for further replies.

PTG

Programmer
Mar 12, 2013
10
0
0
US
I hoping someone here has experienced this issue or seen it and can give us some insight on to a cause; We have exhausted every possible idea on our end over the last 4 months, with no result.
Overview:
Our customer has four office sites, IP Office V500 firmware 8.1(69) control unit at each site. VM Pro on a server at site A. The Network has a sonicwall device, Comcast Cienna router, and an unmanaged 48 port switch as far as I know. They are connected via fiber with a Comcast "copper over Ethernet" MPLS style network; They have a PRI at the main site (lets call it site A), and a few analog lines only at each of the other three sites (B,C, and D), NO PRI at the three other sites; All the inbound traffic (phone calls) come in to the three sites on the analog lines, and all the outbound traffic goes out over the network on the PRI at site A. This was designed to keep the analog lines open for inbound traffic and so the customer could save cost of having a PRI at each site; They dial 9 to get a trunk over the network and dial out on a channel on the PRI at site A. We have programmed IP routes in each IPO at each site, from each site to every other site in a "mesh", so when one site 3-digit dials another, it passes directly and not through the main site.
PROBLEM: At random time intervals, the sites LOSE the ability to 3-digit dial the main site A. They can still 3-digit dial the other sites. (So sites B,C, and D can all talk, but site A loses connection with the others). Along with this they of course lose the ability to access the PRI so calls cannot be made out over the network and grab a PRI channel, and they lose voicemail ability since the VM Pro resides at site A as well. This can happen as much as twice a day, but can also go as much as 2 weeks without incident. When this happens, you just get a "BUSY" tone when you try to 3-digit dial. the strange thing is that the Avaya system status shows NO issues or errors. I can also log in to system status and PING from any of the sites to site A's IP address, even though they cannot reach it by voice call over the network. For this reason we think its a UDP issue and that something is blocking the voice packets? It snot like the voice is being "stripped" off and the call packet arrives with no audio. It is a busy sound. When we watch a call in system status we see the call try to go and then fail. we have traced calls with the various tracking programs but we cannot decipher what we are seeing.
What we have done so far:
We replaced the entire control unit at Site A; The network tech for the customer who has been working with us this whole time, has worked with Sonicwall to check and verify all the settings in the firewall. They feel that if it were a setting, tat it would either work OR not., but not just sometimes. We have also found that it does not seem to be directly related to heavy call traffic. it can happen in the middle of the night. They can leave the office and it be working fine, then the next morning its down when they come in. Comcast support has been of very little help and we are frankly very frustrated with them. they say it is nothing on their side. The only thing they did try was to increase a setting on their end called "PPS" or packet per second which is a compression / limit setting on voice calls, etc.; it looked promising at first but the issue still remains.
Our Ideas:
could there be some sort of "threshold" that it being reached that severs the "handshake" between the IPO and the outside network as far as the voice or UDP side is concerned??? Our senior programmer engineer, who had installed hundreds of IPO systems has ran out of ideas at this point.
RECAP: we reboot the main site A and the problem will be gone for awhile, but always never the same amount of time. anywhere from 1 day to 1 week.
Does any one have any ideas????? Thanks in advance.
 
I don't care what they say but this sounds like a comcast issue, what makes it hard to solve is that it only happens once in a while, this is most likely why they are having a hard time to pin point the issue.

acss sme acis sme acss cm 5.2.1 acss cm and cmm acss aura messaging.
 
That's what we are thinking as well. But in the back of our minds we are still wondering if it could still be some obscure think in the network or firewall, although sonicwall doesn't think so. its just so weird that we don't lose network connection entirely; the Avaya SCN is still up, and we can still "PING" from one control unit to the other, and we see no errors. Just NO 3-digit dialing capability. then a simple reboot fixes it. and there is no pattern whatsoever. lately it has become more frequent. we had to reboot 4 times in the last 3 days...
 
Understood if it was a firewall issue it would be broken all of the time, it could be some flaky networking gear on Comcast end, I still believe that it is on their end, you may have to monitor the network using Wireshark to see what exactly is happening here, or just get another circuit installed

acss sme acis sme acss cm 5.2.1 acss cm and cmm acss aura messaging.
 
regarding your mesh
do you have actual net connections between each site to each other site to support your mesh programming? Or is it really a star.

sysmonitor traces from each site would give good information on the cause of the issue.

 
It is a mesh. Every control unit has an IP route to every other box. But we only seem to keep losing the one site. When they lose the 3-digit dial ability, the other sites can all still communicate with each other, but not the one site, and that "down" site cant reach the others. BUT, we can still piny from system status to that site and vice versa. So we are not losing complete network connection. Only the ability to pas voice traffic, (3-digit dial, access the Vm, etc.)And what makes it critical is that the site we lose is the one with the VM PRO and the PRI which everyone trunks out through to call out.
Just wondering if anyone out there has had this problem. None of my other customers have had an issue like this. But we only have two using the comcast MPLS set up though. This is one of them. When they try to make a 3-digit call they just get a busy and in system status we see the call try to make an attempt, but it fails...
Just wierd that if it was a setting, that it would work all the time, or not. And when it does stop working, it will not ever work again until we reboot the IPO at that site.
 
Until you can solve this issue create an overflow ars for each site b,c and d to use the local copper lines if pri site a is unreachable make sure copper trunks are set to bothways. Still believe this is a Comcast issue

acss sme acis sme acss cm 5.2.1 acss cm and cmm acss aura messaging.
 
Could be that there is a duplicate ip address that bangs your lan 1 where only a reboot will resolve or could be that the lan 1 port has issues and this is the symptom. Wireshark traces will find a dupe ip address and only replacing the chassis will solve a bad lan 1 port.

Just because you have programmed routes into ipo does not mean that you have an actual mesh. You need your provider to have network connections from all sites to every other site to have an actual mesh. In that case your not-so mesh can have routing issues as it tries to send packets around the not-real mesh. Wireshark running at port mirrors for each site can tell you where the packets end.




 
Can you reboot any other piece of equipment to restore voice? Or is only rebooting IPO that fixes it? I agree with others, wireshark at site A while system is misbehaving will tell the tale.
 
Thanks for the input guys. The only thing that fixes the issue is the reboot of the ip office at site A. We tried other sites and nothing changes. And i stand corrected, it IS a "star" topology not mesh. But when someone tries to 3-digit dial another site, (such as site B to C, it doesnt route through site A. Which is why we never lose the 3 digit dial between those sites, just to site A only. And if there was a routing issue on the programming it wouldnt work at all. I just got a call again about an hour an half ago and it was down again. I waited about an hour before i rebooted to see if it recovered, but it did not.
In response to your idea joe, we didnt set up an overflow group, but they can dial 8 to grab an analog line instead when the issue occurs, and when they cant call out on the pRI. I think we tried an overflow but the issue is that the system would have to see that the PRI is "not available" in order for it to automatically go to an overflow, no??? It thinks the PrI is still there, the call just fails.
As for the IP address conflict, i believe we looked at that too. I think we went with a different IP address on the IPO just in case, no change. I will verify this with the IT guy though.
As for the last idea of changing the whole chassis, i believe we did that as well. But i will verify with our tech that we didnt jut switch out all the modules in the chassis. However, with that being said, wouldnt a bad LAN 1 port cause us to lose network connection (TCP) altogether?? I can still ping the control unit when they are down and i never see an error in any of the other control units at the other sites that says there was an error reaching that site (the "no response from IP xxxxxx", etc. error).
 
He meant does rebooting the router restore connectivity, not just the IP Ofice. I have seen a few issues with Drayteks where rebooting the system or router would fix SCN issues, turned out to to be stale NAT sessions or something like that. So rebooting the system helped but only because it allowed the router issue time to clear, the issue was the router(s) :)

 
How would sites B and C avoid routing through site A if you have a star where A is the hub? Is your MPLS provider routing at their facilities?

 
You are correct, i stand corrected. It IS a "mesh" topology, NOT a "star". My fellow programer who actually built most of this system misinformed me. Sites B, C, and D can use their analog lines to call out when site A goes offline. They just lose the ability to reach that site and access the PRI for outbound, and Vm Pro.
And to double check the comment someone made about a duplicate IP, if that was the case, when we reboot site A i get an error on system system status on the other sites showing that IP is unreachable during the reboot time. If there was a duplicate IP on the network we wouldnt see that error right? They would still see a device pinging back...
 
My guess is that you have a star network with ip routes in IPO like you have a mesh.
It feels like a mesh since calls route over analog trunks when scn to main site is down.
Main site defined as the site with centralized voicemail installed.

When there is a duplicate address on the network arp table will be overwritten with the mac address of the device with the same ip address as the ipo control unit. When pinged you receive a response albeit from the wrong mac address. arp -a from the vmpro computer will list the ip address-mac address assignment. Wireshark will also show the incorrect mac address-ip address assignment. One big tell is that manager can report network issues when trying to retrieve a config and will not load cfg at times.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top