Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Network slowness

Status
Not open for further replies.

itstaff2011

IS-IT--Management
Oct 11, 2011
9
US
Description of the issue:

My company experiences network slowness, (3000ms+ ping times). It typically happens on only Monday mornings between 8am and 11am(times are not exact). But in the last 2 weeks it seems to happen other days, even as I type this, we are experiencing 1000ms+ ping times. We have 2 bonded T1s coming into the building which connects to a Cisco 2800 router and that connects to a SonicWall firewall. I have had SonicWall look at the FW config and logs multiple times but nothing looks "odd". We have had our ISP remote into the 2800 router, during the slowness, and ping out with normal responses (~30ms) and showing we are only using about 25% of our bandwidth. I have connected directly to the lan port on the FW and I receive normal responses but within 3-5 minutes of plugging the core switch back in, the slowness reappears. The issues first appeared somewhere around the time when we replaced all of our switches, which was about 3 months ago.

What we are looking for:

Basically, we are looking for some way of being able to monitor our network traffic during this period of slowness to determine what is causing the issue.

If you need more detail to be able to help, let us know and we will get it!

Any ideas or thoughts will be appreciated.

Thank you in advance for your help!
 
Try running Wireshark during the slowdown times. There could be a misconfiguration of your switches causing a storm when everyone turns on their pcs or something similar to that.


Run a trace with this program and see if you see anything weird happening traffic wise.
 
Before you start with packet captures, it might be quicker to simply check all your switchport utilisation during the time of the slowdown.

Presumably you will see most/all your switchports with a high Tx utilisation - try to find the switchport originating this traffic by checking for which ports have the same high utilisation as their Rx - you can very quickly follow the broadcast back to source in this way and find, perhaps, a server with two active interfaces but misconfigured IP and routes on it.
 
On baddos's point, I'd try digging through your STP configuratuion to make sure there's no potential for layer 2 loops (eg: a single server that is bridging traffic between two NICs towards two switches without spanning tree could do it).

Since it's intermittent, and the WAN has been ruled out, I'd say either the LAN is getting saturated by an application that is running at certain times of the day, or some action is triggering a layer 2 loop and terminating it.

Look at the transmit/receive usage stats on the switch trunks during the slowdown, setting the load-interval option on the interfaces to something like 5 seconds for real-time accuracy. If usage is exploding, do a 'show spanning' on all your non-root switches and make sure it is still a loop-free topology. Look at the sys logs for mac-flap errors, or (if supported) configure storm control on the switch ports to see if anything triggers it. If the topology is loop-free and still exploding with traffic, I would suspect an end device as the culprit and try to follow the high usage flow to the correct switch cabinet, and eventually to the correct port(s).

CCNP, CCDP, CCIP
Core Network Planner, ISP
 
Thanks for the quick responses. We are actually in the process of trying Wireshark which we had someone look at and found out we needed to mirror some ports on the switch so it could see all the traffic. What do you mean by "run a trace with this program"? I dont know much about Wireshark.

I will definitely look at the Tx and Rx utilization on the core switch (gateway for all PC traffic).

I am not sure how to dig into the STP configuration. How would I do that?

A little more detail. Our servers have 2-4 NICs which are teamed using LACP and the core switches (2 stacked 3750s) are etherchanneled using LACP on the ports with a server or switch plugged into it. I'm sorry if this last little bit doesnt make sense, I'm not the best when it comes to the ciscos.
 
So the 3750's are *stacked*, and are acting as L3 gateway for the servers... Logically they're one switch from an STP standpoint, so less loop potential there.

Are you seeing any "mac flapping" or duplicate mac address errors in the logs? Are you seeing anything of note with a "show log", assuming the log buffer is enabled?

For port mirroring with wireshark, yes usually you would configure "SPAN" on the switch to send traffic from different ports to be mirrored onto one port, where you connect that port to the wireshark server. You'd use the "monitor-session" source and destination commands in global config mode to set that up.

CCNP, CCDP, CCIP
Core Network Planner, ISP
 
Or you could just insert a hub in the segment in question since it broadcast out all ports. Then plug in a laptop with wireshark on it to capture all the packets.
 
I did a 'show log' on the stacked switches. I have attached the what displayed from the 'show log' command. I didnt see anything that looked like duplicate mac addresses but it also didnt look like it was showing much.

I have heard about the hub solution but we dont have one so we wound up using the 'monitor-session' command to do it. But the people that we had check over the wireshark log didnt seem to find anything really unusual...
 
 http://dl.dropbox.com/u/19481227/192-168-99-2%28logging%29.log
Nothing useful from wireshark, are they experienced in using it? What segment did they plug into, because the issue is obviouly between the FW and the core switch or the switch it's self.

is the connection you're pinging across part of the etherchannel? If so you should inspect those ports.

Also, I'm seeing a lot of up down's in the logs. Are you doing that?
 
From what I know they are experienced with it but I'm taking their word that they know what they are doing with it. I believe we still have one of the logs that I could post (I'll need to check). I am about to ask some dumb questions, so bare with me...

What do you mean by "What segment did they plug into"?

What do you mean by "is the connection you're pinging across part of the etherchannel"?

All the up/downs look to be the 2 servers that we rebooted (one yesterday and one this morning).

One thing is that I dont really know that the log is even setup to record anything that would be helpful in a situation like this.

 
Was this wireshark trace done DURING a network slowdown?

Also what is it that is logging into the switch via HTTPS several times a second?

Also, while probably not related to the slowdown I can see from the logs that you've got a duplicate IP detected (192.168.0.2) on vlan 5.

Could you do a "show spanning vlan 5"? If that switch is a stack, and L3 gateway for the servers, it should be the only switch on the block, so to speak.

CCNP, CCDP, CCIP
Core Network Planner, ISP
 
Yes, they were on looking at it as it was slow and they said the traffic looked normal (close to when we werent slow). There was some mention that there could be a storm but I'm not sure what he meant by that (I didnt talk with him, my coworker did). That IP that was showing up a lot was my PCs IP. When I saw that, I unplugged my PC and we were still slow. I think I may have been doing a constant ping so it could be from that?

Show spanning vlan 5 results:

lvl_burl_core#show spanning vlan 5

VLAN0005
Spanning tree enabled protocol rstp
Root ID Priority 8197
Address 30e4.db05.dc80
This bridge is the root
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec

Bridge ID Priority 8197 (priority 8192 sys-id-ext 5)
Address 30e4.db05.dc80
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Aging Time 300 sec

Interface Role Sts Cost Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi1/0/1 Desg FWD 19 128.1 P2p
Gi1/0/15 Desg FWD 4 128.15 P2p Edge
Po1 Desg FWD 3 128.512 P2p
Po2 Desg FWD 3 128.520 P2p
Po3 Desg FWD 3 128.528 P2p
Po4 Desg FWD 4 128.536 P2p
Po6 Desg FWD 3 128.552 P2p
Po7 Desg FWD 3 128.560 P2p Edge
Po8 Desg FWD 4 128.568 P2p
Po9 Desg FWD 3 128.576 P2p
Po10 Desg FWD 3 128.584 P2p
Po11 Desg FWD 3 128.592 P2p
Po12 Desg FWD 3 128.600 P2p
Po14 Desg FWD 3 128.616 P2p
Po16 Desg FWD 3 128.632 P2p
Gi2/0/15 Desg FWD 4 128.71 P2p Edge
 
I assume all those port-channel interfaces are facing servers, from your descriptions? Are there any cases where multiple port channels exist between the same server and the switch?

By "storm", they're probably referring to a broadcast storm, which is what I'd also suspect. Are any of your servers intended to run Spanning Tree?

CCNP, CCDP, CCIP
Core Network Planner, ISP
 
The port-channels are either pointing to a server or a switch. There is not more than one port-channel pointing to one device and there are not multiple devices pointing to one port-channel. I have attached the log and it looks like all the ports are set for spanning tree.
 
 http://dl.dropbox.com/u/19481227/192-168-99-2%2820111015%29.log
Wait, how many switches do you have in this topology? What version(s) of spanning tree are they all running? Are they ALL running RPVST+ as that one is, or are there MST/VST instances out there? Is this a multi-vendor environment? Do you have a general diagram of the switches in this network? Have you taken steps to hard-code bridge priorities? I notice at least for that switch that STP instances for vlans 1, 5, 10 and 15 have hard-coded priorities lower than the default, but are those all the vlans?

During the network slowdown, get the "show spanning" output on the other switches. Make certain that all switches agree on which switch should be root bridge for each STP instance, and make sure that ports are in the blocking state where appropriate to prevent a loop.

CCNP, CCDP, CCIP
Core Network Planner, ISP
 
I don't think I've seen any info about his switchport utilisation yet.

I *really* think the first step is to follow the trail of high-utilisation (Rx) back to the offending host, if that's what's happening.
It's a quick and easy bit of troubleshooting that could answer the question.
 
I have been engaged to investigate several 'network meltdowns' and the one big spanner in all this is the customer wanting the network 'back'. This hinders troubleshooting and methodical fault-finding as when you inevitably find the offending device/cable/application and remove it and restore 'normal' operations there is usually no way of the customer letting you reconnect, switch on or start the offender.

The issue here definitely sounds like its layer-2 - broadcast storm, undetectable loop etc. I'd happily come and have look, fix it and make some recommendations for you if you can stomach my daily rate....

Andy
 
@ Quadratic
We have 10 Cisco switches in our network (2 of which are at a different location). I was unaware there where different versions of STP so I'm not sure on that or where to look.

One shows this (3560 at location 2, not main location):
spanning-tree mode pvst
spanning-tree loopguard default
spanning-tree extend system-id

One shows this (2950 at location 2, not main location, just recently added):
spanning-tree mode rapid-pvst
spanning-tree loopguard default
no spanning-tree optimize bpdu transmission
spanning-tree extend system-id
spanning-tree vlan 1,5,10 priority 8192

The remaining switches show this:
spanning-tree mode rapid-pvst
spanning-tree loopguard default
spanning-tree extend system-id

I didnt setup any of the switches (outside company did it), but the only thing I saw priorities on was the 3750s and the 2950(I just copied from an old config). Just in case the thought crosses your mind, we had the slowness issues before that switch was in place.

I didnt do a 'show spanning' when we were slow today because I hadnt seen this yet.

@ Vincewhirlwind
I did look at the port utilization during one of the slow times and the only ports that had "high" utilization was servers, uplink/downlinks, the port connected to the firewall and the port connecting to the router handling phone calls.
 
 http://www.mediafire.com/?e1saej1ulkm7coj
So, from the sound of things, you aren't getting a broadcast causing this. Explains why your wireshark didn't help.

You will have to narrow down the problem: what ping times are showing "slowness"? ie, what paths are affected? Which hosts are slow to which servers?
Start from the beginning - setup a continuous ping to your default GW. Ping another PC on a different VLAN. Ping your email server, proxy server, file server, AD server, and any others you can think of.
Is there slowness everywhere?

Check CPU utilisation on each switch and router.

Give us your core switch config and an edge switch config so we know what kind of network you have.

 
Sorry about the delay. I thought I replied to this. The pings that are slow are pings to sites on the other side of the firewall. The ISP checked and during the slow times we are only using 1/4th of our bandwidth. All pings to internal servers are normal. We put a switch between the firewall and the router and ping times were normal. We move it back between the FW and it slows back down. It seems like a FW issue but the utilization is low and we arent exceeding the connection capacity. We have also had Sonicwall remote in and look over the FW and they havent found anything. It seems that something on our network is hogging bandwidth but we just cant find it. We have had 3 companies look at our switch configs and they all said they look normal. The utilization on the switches is low (usually around 20%). There are two routers inside the network and that that control our voice traffic and traffic from this location to the other end of our private T1. The first link is the core switch config and the other is the edge (about the same on all).


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top