Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Entire Network Down - 100% Network Utilization - Please Help! 3

Status
Not open for further replies.

link470

Technical User
Sep 1, 2005
19
CA
Alright guys. So as you can imagine, network goes down, I'm hoping to get this resolved soon. Here's what happened.

It's 12:30 P.M. [Do YOU know what you're networks' doing?]. Everything is running great. 3 main switches in the server room, 48 port managed switches [Dell PowerConnect 3348's] and all is well. Everything functions normal, all servers are online, all desktops are happily happenin'. I go out to get a shipment of 50 new machines that arrive and start piling them outside my office. Next thing I know, I have lots of requests saying the entire network is down and nobody can access anything. I quickly head back to the server room wondering if a UPS went down, if a server restarted, if a switch turned off, anything. But what do I see? Absolutely nothing out of the ordinary. Everything is functioning great.

But wait...no it's not. I can't get on the internet. I can't ping ANY computers, I can't remote desktop into the servers, the RDC's I DO have up with servers all fail, and everything is extremely slow. So I call the school board office. They head over with their handy $18,000 fluke meter. They plug it into one of our switches, and it measures our network and quickly throws back at us a 100% Network Utilization. The guy from the board office goes WHOA!!! I've never EVER seen it that high before.

So we try swapping the first switch in the stack under suspicion it may be bad. We put in a 3448, Dell's next model of the 48 port 10/100 PowerConnect switch and take out the 3348. We use patch cables to link them together in a chain setup and see if that works.

In the end, the switch switch [lol] did nothing. I still can't ping any machine in the school or get out of the network. I checked our main router, and it's functioning normally. I restarted the servers and they all appear to be functioning normally. So I think to myself, what would cause 100% network utilization. I noticed that ONE ping got through, but only 1 out of the 4 pings got a reply. So I knew the infrastructure itself was probably ok, but I had the assumption that something was loop backing.

So off I go around the network. Documenting every single wall jack and port in the school [took 7 hours] and checked EVERY switch we have for any type of loopback that might be possible, like a jack plugged into a switch, and another port on the switch plugged into another jack. I also shut down every machine I came across and every network printer I came across so the network can essentially have nothing to broadcast [except for powered nics but since the machine isn't on it's less of a chance of being anything malicious running on the machine itself]. Nothing from what I could see in any of the labs was like that. All the jacks had either a direct connection to a PC, or a connection to a switch, that contained only other connections to PC's, not back to a wall jack.

So here I am guys. It's the weekend, Friday night, schools not in for the weekend so I've got 2 days to work and go in for some overtime. Anyone have any suggestions of what to try next? Thanks a ton you guys. I really appreciate this community and hope we can get this resolved!

Network Notes:
-Windows NT Based Network running Windows Server 2003 servers and 250 XP Professional client stations.
-6 Windows Server 2003 servers total
-3 main switches in server room, 7 others around school. All checked thoroughly and restarted.

Take care and have a good weekend. We have multiple IP's to the school so since I can't access the internet inside the school because of extreme lag, I'll plug in to the main external switch plugged into the modem and grab an IP for my laptop if I go in so I can check in on answers.

Thanks guys!
 
You have a flat network and an undetected loop occured - probably a broadcast storm that just kept replicating. Do you know the physical & logical topology of the network - based on what you have written it looks like you don't; i.e. your first troubleshooting action was to replace a switch.
I suggest you get the network toplogy accuratley documented, and then get someone with the appropriate knowledge to took at it and recommend some changes as it obviously needs some.

Andy
 
Thanks for your reply :)

I do know the layout and I think I'm the only one in the district who actually documents lol. It was the guys at the main board office who came and replaced the switch thinking that was the problem. I'm looking now into broadcast storms and shutting down each machine and disconnecting the switch uplinks so I can isolate which switch holds the problem'd nic's.
 
Well i think you are on the right path, sounds like a broadcast storm, maybe an arp fight, or could be a rouge device hidden on the network by a naughty student. The correct course of action would be shut everything down and get a base line going or to see if you can get a stable network then start bringing devices online until it breaks again. You could also bring up something like ethereal(wireshark) and sniff packets to see if you can isolate the problem device. If you can get into your managed switches you should be able to see port counters of some kind and probably ID the problem device. You would be looking for a port with a very high number of broadcast packets. Anyway i hope you figure it out, sucks working all weekend on that stuff.



RoadKi11

"This apparent fear reaction is typical, rather than try to solve technical problems technically, policy solutions are often chosen." - Fred Cohen
 
I would start at the top and work your way down... what I mean is, I would start with your main switch and make sure no other switches are plugged in and test your connectivity. Then I would plug one switch in at a time till your utilization goes back up. When it does, unplug that last switch you brought online and make sure that utilization is back to normal. If you have switches cascading off of switches, you can go to that suspect switch and do the same thing, making sure other switches are disconnected and plug up the suspect switch. It might be the switch, it might one of the switches down the line. One thing to also keep mind of, it could be a jabbering NIC in one of your pc's, it could very well be a loop someone introduced, and what has happened to me before, it could be a virus that is saturating the network with traffic.

I have never used Dell switches, but I know with my HP Procurve switches, I'm able to use the managment software that came with them to see my top talkers and also get the IP address of the pc that's plugged in that port. I can then disable on a port by port basis. Maybe the Dell's have something simular.
 
Thank you all very much for your reply.

Thank you all for your replies. Much appreciated! I ended up taking a laptop into work, ran wireshark, found a TON of packets, like, in the 100,000 range almost instantly. I ended up seperating our switch stacks, isolated it to one switch, and that switch was looped into another switch...twice. Everything is back up and running after disconnecting just one of those cables.

What's strange, is I think it's been like that for quite awhile and nothing ever happened before. I may be wrong, but does this sound possible? As of now, the entire network is up and running again, and I thank you all so much for your support and quick suggestions and replies. I'm just chillin' at home now very happy but still wondering if it's possible that there could have been a delay and that the storm of broadcasting didn't catch on till later? The set up was that the main switch [switch 1 out of 3 switches connected together via gigabit uplinks] was plugged into a spare 4th switch down below that the previous tech had used as a spare because there wasn't enough places to plug things in [the patch panel had more ports than the switches could support in that room]. Only 4 things were plugged in. 2 of them were from patch panel locations to connect wall jacks around the school, and 2 were the redundant connections plugged into switch 1 that after removing 1 of those, everything worked again.

Any ideas if it's possible for a delay to happen and it not really get to the point of this until now? Any ideas of what triggered it so suddenly to be problematic?

Either way, it's up and running. Thanks a ton!
 
Could anything code-wise or configuration-wise have changed? Again, not familiar with Dell switches, but spanning tree could have been enabled before, causing the looped ports to go into a blocked state and everything work ok. Some config change or code change could have disabled spanning tree so that ports were no longer being blocked and BAM!, broadcast storm.

Just a guess... glad your back up.
 
As I said in my first post - you had an undetected loop in the network and switches being switches they simply replicate any broadcasts received out of each port except the one it was received on. If a loop occurs - i.e. the replicated traffic is received by the same switch that replicaed it, it just goes on and self propels leading to a 'meltdown'.

If as I suspect this is a 'flat' network (i.e. no VLAN segmentation) then make sure STP is enabled and you know the topology - i.e. which ports are forwarding and which are blocking. You should also implement some protection mechanisms and management, however I don't know much about Dell switches so I can't offer any configuration advice.

Andy
 
Maybe a power outage (system-wide or someone tripped a breaker?) reset the "guilty" switch and when it came back up the spanning-tree was disabled. And as a result so was your network.

Just a guess.

Joe B
 
I'm not a switching expert (and am fighting my own STP design issues) but on the 48 port dell switches - or at least mine - they are not like the smaller managed switches. Config changes are not immediately saved to flash so if STP was enabled and things were fine but the change was not saved and the power cycled it would not have preserved the STP settings. This is my experience with the power connect 5448 (3 of them). The smaller ones I have, 8port and 24port don't seem to have as many, or as complex config options and seem to save the changes immediatly on making them through the web browser (whereas the 5448's I do the configs through Telnet since it has the capability and I'm a CLI junkie)

Mark / TNG
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top