450 Stops Responding to Ping Periodically, but Always Forwards Frames

AbstractZero · Feb 12, 2004

This is weird. Sometimes one of my 450 switches stops responding to ping. While it is unresponsive, I can still ping THROUGH the switch to equipment on the far side. My configuration is:

Me ---- 450a ---- 8600 ---- 450b ---- 450c ---- host

The problem is in 450b. Sometimes pings to it will timeout for a minute or more. During this time, the host at the end can still be reached! Further, even the 8600 switch cannot ping 450b when it is "down." But end-to-end connectivity never stalls.

WHAT IN THE WORLD IS THIS????

--Eric

jimbopalmer · Feb 12, 2004

Ok here is my guess.

In version 3.1 of the 450 firmware, one could always forward AND ping.

version 4.0 was so much slower, they biased it to functionality over manageability, so sometimes it is hard to connect to. (larger the stack, the worse this is)

By 4.2.0.22 they had the code tightened up to where it could forward and ping all the time.

I am betting that 450b is on firmware 4.0 , 4.1 or a very early 4.2. if so you can upgrade it.

I tried to remain child-like, all I acheived was childish.

AbstractZero · Feb 12, 2004

That was my guess, too, but I did not want to influence the answer. I noticed that the switch responds to pings in 20-40ms, while devices attached to its ports respond in <1ms. That lead me to believe that the code puts frame forwarding far above management functions in order of priority, but I would not have expected it to drop management functions altogether. Your comments about the switch code versions are well received, but I seem to be on version 4.4.0.6. See the login banner below. You think maybe it is the firmware?

*******************************************************
* Nortel Networks *
* Copyright (c) 1996,2003 *
* All Rights Reserved *
* BayStack 450-24T *
* Versions: HW:RevL FW:V1.48 SW:v4.4.0.6 ISVN:2 *
*******************************************************

jimbopalmer · Feb 13, 2004

Darn I confess I stopped at 4.2.0.22 and have not upgraded further (not broke, don't fix it) so I cannot report if later versions went back to 'slow'

Version 1.48 of the low level code is the same as I use.

in all honesty, I cannot reccomend downgrading the switches, so that is out unless you have a spare to test with, then I would load the last version 3.1 code on it

What I can recomend is that you download the 90 day trial of Optivity Switch Manager and run tests to see that your VLANs, Duplex, and Multilink trunks are all identical on each end of your backbone cables.

I tried to remain child-like, all I acheived was childish.

eurobadger · Feb 16, 2004

Hi

I am seeing a similar issue on a Passport 8100 running v3.3.3.0 code. The management contact to the vlan gets lost, although all traffic on the switch appears to be passed normally.

Does this ring bells with anyone?

EB

wanyph · Feb 18, 2004

I have the same problems with passport 8600.
when I use the device manager und ping the passport at the same time, I get many timeouts.

mewi

ManxMann · May 6, 2004

Hi Folks,

Now this issue sounds very familiar it's exactly what we're experiencing but we have one slight twist.

We run two 6x450T-24 swicth stacks. All stack members are connected using cascade modules (400-ST1 using white cables).

The two stacks are then connected using a for port multilink trunk running over fibre (400-4FX).

Stack 1 has an IP of 10.190.50.100 FW1.48 SW 4.2.0.9 H/W Rev L
Stack 2 has an IP of 10.190.50.200 FW1.48 SW 4.2.0.9 H/W Rev L

The stacks are configured with a single VLAN. So nothing fancy here.

From my machine connected to Stack 1 I ping Stack 2 at what appear random times the PING times out exactly as detailed in previous posts.

The twist is that we DO get packet/frame loss, this came to light as we kept getting random disconnects from a FW1 box connected to Stack 2. The firewall was rebuild on completely new hardware, thereby eliminating it from the fault finding process, however the disconnects still continue. Now that we knew what to look for we find we also have random timeouts on other various devices connected to Stack 2, but not all of them.

Moveing the firewall from unit 1 (base) to unit 2 (Switch with Fibre module) has reduced the timeouts by 95%+

The two have been found to coinside by running a continual ping on Stack 2 and watching as first our firewall followed by other devices disconnect then the switch goes on holiday too!

Stack 2 is under a medium to heavy load with about 10 high performance servers running from it and around 90% port usage by workstations. The MLT shows a average of .5% usage.

Looking at the logs dosn't reveal anything usefull.

Stack 1 does not seem to suffer from the problems, lower usage.

Any help would be appreciated, do you think 4.4 would be a good way to go or should I spend more time looking at the cascade modules/cables?

Many thanks

Simon

ManxMann

Stack 1
Unit 1 - Base (400-ST1 + 400-4FX)
Unit 2 - (400-ST1 + 400-4FX)
Unit 3 - (400-ST1)
Unit 4 - (400-ST1)
Unit 5 - (400-ST1)
Unit 6 - (400-ST1)

Stack 2
Unit 1 - Base (400-ST1)
Unit 2 - (400-ST1 + 400-4FX)
Unit 3 - (400-ST1)
Unit 4 - (400-ST1)
Unit 5 - (400-ST1)
Unit 6 - (400-ST1)

MLT Config (Stack1) 1/25 1/26 2/25 /26
MLT Config (Stack2) 2/25 2/26 2/27 2/28

InDenial · May 7, 2004

hi everyone,

We did have similar problems with bps2000 switches. No management but the data through the switch is fine. We had this happen on several switches during a small period in time.

After a couple of minutes we could access the switch again and we saw that the switch reloaded cause the uptime was near to nothing.

As we have had switches that had an uptime of about 130 years (???). My guess is that after a certain uptime the switch resets the management part. Wich explains the uptime of near to nothing.

I am not sure exactly what the software version is but I will post that later...

Since it has no impact on the data that passes through the switch (only on management) and we are changing to Optera Metro 1400 Ethernet Switch Modules we are not puting much effort in solving this problem but I am still curious to why this happens.

InDenial

Wini · May 10, 2004

Similar problem with BS470 and latest BoSS3.1
We also loos ICMP-connectivity to this box from our Netview.
WEorkaround is to use a smaller subnet-mask to force the packet via the next router.
Apparently an ARP-cache-issue. Error-messages appear as following on the console-port:

BayStack 470 - 48T HW:#05 FW:3.0.0.5 SW:v3.1.0.78 ISVN:2

ip: 163.157.162.71
mask: 255.255.254.0
GW: 163.157.162.1

######################################################################################
f15008o-a#ping 163.157.162.160
0x1dd62a0 (tPingTx4): arpresolve: can't allocate llinfo0x1dd62a0 (tPingTx4): are
f15008o-a#sh arp
Port IP Address MAC Address
---- --------------- -----------------
1 163.157.162.1 AA:00:04:00:0E:14
1 163.157.162.5 AA:00:04:00:0F:14
f15008o-a#ping 163.157.162.160
0x1dd62a0 (tPingTx5): arpresolve: can't allocate llinfo0x1dd62a0 (tPingTx5): are
f15008o-a#ping 163.157.162.123
Host is reachable. time=30 ms
f15008o-a#sh arp
Port IP Address MAC Address
---- --------------- -----------------
1 163.157.162.1 AA:00:04:00:0E:14
1 163.157.162.5 AA:00:04:00:0F:14
1 163.157.162.123 00:04

C:3E:26:47

JohnE66 · May 17, 2004

I have had ping loss on 8600s. The throughput seems to be fine, but it is a pain not being able to trust the NMS indicating that the 8600s have gone critical. The 8600s are currently on 3.3.4 and we are waiting for downtime to upgrade to 3.5.3.

allyc75 · May 17, 2004

ManxMann,

We had an identical problem to you a while back. It took Nortel ages to find the fault as it was intermittent. However they eventually identified a problem with code pre 4.2.0.16 when using MLTs as per this release note :-

[Q00500160]—In some previous releases, a large stack that had DMLT
configured could experience dropped packets and connectivity problems
when the same MAC addresses were learned on the various DMLT interfaces.
Inconsistencies in the MAC table and CAM could develop. This has been
fixed with software version V4.2.0.16.

Upgrade to relase 4.2.0.16 or above and hopefully this should cure your problem.

Let us know who you get on.

AC

Wini · Jun 24, 2004

Hello, we have same problem with BS450-48T-stack
of 2 units. Unit 1 has a Gigabit-uplink to the core and to
a WAN-router. Unit 2 has a 100-Mbps-FX-module to connect
to another BS450-unit in a second closet.

The problem appeared after upgrading the ports from static 10 Mbps Half-Duplex to Autonegotiation for all User-Ports.
The Ports are working now with 100 Mbps Full-Duplex and we receive periodically ICMP-Ping-timeouts now.
The device is installed in a remote location.
Interesting workaround:
As soon as I clear the ARP-cache in the next router, the device is reachable again.
Apparently ARP-broadcasts are still answered while ICMP-Pings are not. No problems with connectivity to the users.
Also the single BS450 which is connected behind the stack is still reachable.
What a cowshit !!

netmanrick · Jun 24, 2004

I have been following this postwith a great deal of interest.
At our core is a 8610 with ATM,MM fiber,SM fiber,and TX copper ports.
A 8606 is at the edge with ATM,TX copper,and MLT SM fiber.
Our 350 & 450 switches have 4.2.0.9;4.2.0.16;and 4.2.0.22
code.
None of our switches are exhibiting the behaviour mentioned here.
Some of the 350 switches are direct TX connections to the 8610.Most of our switches ore on the other end of a ATM/FR cloud from the 8610(Usually a ARN & a 350 or 450 switch).
Between other 350/450 switches is fiber and a 1100 or 1200 switch with TX or fiber connections to stack of 350 or 450 switches.
The 8606 at the edge with atm has fiber mlt trunks to 470 switches with BOSS 3.0 code.
We monitor this equipment with Optivity,Device Manager,WUG,MRTG,and Solar Winds(plus Telnet,Ping,and Tracert).
I do not see any of the problems you have mentioned.
I look forward to the resolution of this issue.

Rick Harris
SC Dept of Motor Vehicles
Network Operations

V11 · Jun 24, 2004

We have this problem on 470's. It happens at the same time multicast traffic is high (checking the igmp snoop shows lots of connections). The 470, 8600 and other switches place responding to pings way down on the list of importance. (Behind things like forwarding packets) If the switch gets too busy, it won't respond to pings.

We often have people from the server groups complain that they're getting long ping responses to their gateway on the 8600's and assuming that this means their connections are slow. It doesn't mean this at all. Pinging the 8600 ip interfaces is not a useable performance trouble-shooting tool.

JohnE66 · Jun 24, 2004

It's funny you should say that about multicast. We have had a problem with ping responses from the 8600 for some time. The main multicast is located on a separate VLAN with IGMP at the edge and DVMRP in the core. However, our default VLAN with the users and ping responses only had IGMP. We enabled DVMRP on the 8600 for that VLAN a couple of weeks ago......so far no ping loss. There are only a couple of low bandwidth sources on the default VLAN but this may hav solved the problem.

It is early days still and I am still monitoring the logs.

AbstractZero · Jun 24, 2004

Good suggestion about multicast. Where do I check to see if that might be a problem on my 8600 or 450?

JohnE66 · Jun 24, 2004

On the 8600 you can enable DVMRP globally, but make sure IGMP is disabled. You then enable it per VLAN. Be careful when you do this as there will be a while when both will be disabled and the multcasts may be free to flood your network. DVMRP is to route multicasts, be we don't use it for that, only to learn and prune the multicast trees.

IGMP should be enabled per VLAN on the 450 (I haven't used these, only the 460s and 5510s. Check to make sure the edge switches are all on IGMP 2. This cannot be done via JDM but can through telnet. If any devices are on IGMP v1, then all the devices may work in that mode. I have been told this last snippit, but have not seen it documented.

rjenk · Jul 19, 2004

With the 8600, the system gives ICMP a very low CPU priority so if the CPU is busy doing other things, the PING will either return extremely long ping times or will time out. This is by design.

With that in mind, I would look at what is going on with the switch to make sure that there is not something else going on that is spiking the CPU. As referenced, multicast has caused us issues in the past (Symantec Ghost).

One of the hardest things to get our support staff to understand is that longer ping times on the 8600 gateways were acceptable and could no longer be used to verify traffic (they were used to pinging a BCN).

rjenk · Aug 12, 2004

We recently upgraded code on a few of our pure 450 stacks and have encountered the dropped ping issue. In speaking with Nortel support, we were told there was a known issue with dropped pings. I have not received any additional information but will pass it along once I get it.

We currently have a case open with them on it.

rjenk · Aug 12, 2004

continued...

Little more info...

If we placed a BPS in the stack, making it hybrid, the dropped ping issue went away. This only appears in a pure 450 stack in our testing.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

450 Stops Responding to Ping Periodically, but Always Forwards Frames

IS-IT--Management

Programmer

IS-IT--Management

Programmer

Technical User

Technical User

Technical User

Technical User

Technical User

MIS

Technical User

Technical User

Technical User

Technical User

MIS

IS-IT--Management

MIS

IS-IT--Management

IS-IT--Management

IS-IT--Management

Similar threads

Log in

Part and Inventory Search

Sponsor