07-16-2015 05:56 AM - edited 03-08-2019 12:59 AM
Hi everyone,
I've been troubled for a while with a storm problem that brings down parts of my network. The storm is triggered randomly during business and out of business hours. It is not a specific protocol storm (like arp), but it looks like everything is looped until the cpu utilization starts screaming. The strange thing is that we have no looped physical topology. After blocking some of the protocols (like nbns and llmnr) to just limit the scale of the problem, other protocols kicked in. We have also impleneted storm control to the user ports and the uplinks. No user ports are blocked, but when then problem takes place the uplinks are the ones that go errdisabled. Wireshark has not pointed us to a specific direction since we never see the beginning of the issue, rather we just start recording in the middle of the storm which doesn't help much. Other times we see specific storms from specific PCs but each time it is a different pc. I would like to also add here that we have a number of developers with VMs, hunting the ones with bridge mode to turn it to shared (mac parallel) or nat (windows) etc. Still we have never seen such a pc link going errdisabled from the storm control (edit: only the uplinks)
Our equipment is 3750-x and there has not been any design or other changes in the network when the problem started. Almost from the start of this network existence. Any ideas how I can track down the root cause?
07-16-2015 06:42 AM
Hi,
from the information you provide, it is difficult to see what the problem is.
But if you have storm-control enabled and you still see the issues. I don't think it is a broadcast problem but much more a loop issue.
What is the cause for the errdisable on the uplinks?
Could you maybe provide a diagram of the situation and indicate which links get blocked etc.?
Regards,
Markus
07-16-2015 07:43 AM
Thanks for the reply Markus,
Exactly there is a non physical loop that multiplies broadcasts and especially query messages from various protocols. The cause of the errdisable on the uplinks is the storm control I have configured everywhere. Including the user ports. But the user ports do not shutdown apart from a couple of times a mac-mini did go errdisable. But this was 2 out of 30 storms.
The uplinks that usually have the problem are the ones of A and B in the diagram. If they don't be stopped in time by errdisable it is possible they spread to the rest of the company.
Now the strange part. That two switches A,B serve many developers that do have VMs and I am already hunting soem of them who are in bridge mode. However there has never been a user port serving bridged VMs errdis. Nevertheless I have already change many of them to either shared or nat mode till I find them all.
Edit: Also everything in the diagram is Layer 2 up to the core, so all the switches are in the same domain, with wired network of /23 and wireless of /22. Left part of the network (3 switches) is one building, and right part is another building. C to core connect by cwdm fiber. Line of sight distance around 150 meter. Real distance way longer.
07-16-2015 10:02 AM
What does spanning tree say?
Is it stable
sh spanning-tree detail | i from|exec|topo --> check topo changes?
also try enabling mac move notification to see any offending macs
conf t
mac address table notification mac move
the above may not be exactly correbt..but you can figure it out i hope...else let me know.
Thanks,
Madhu
07-16-2015 12:17 PM
Soanning tree looks ok. Last change was during a storm. Because it was cut off. And we are almost sure it is a loop not a broadcast storm. I have already enabled the mac move notification so next time we are after some cables.
I will also check much closer what port broadcast counters are violated more during a loop.
07-17-2015 03:23 AM
I see something weird at the sh spanning-tree detail. The last change was during the time of the loop which I think it is normal because that switch lost connection to the root during the issue. However i don't understand this:
Number of topology changes 5 last change occurred 22:54:25 ago
from StackPort2
In detail:
VLAN0212 is executing the rstp compatible Spanning Tree protocol
Bridge Identifier has priority 32768, sysid 212, address 1cde.a7b7.e600
Configured hello time 2, max age 20, forward delay 15, transmit hold-count 6
Current root has priority 24788, address bc16.f550.6580 < -- this mac is correct. it is the core
Root port is 1 (GigabitEthernet1/0/1), cost of root path is 4
Topology change flag not set, detected flag not set
Number of topology changes 5 last change occurred 23:04:39 ago
from StackPort2
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0, aging 300
This is the same for every stack and for every vlan. Every stack has master the second switch. So I guess stackport2 means stackport 2 of the master switch. But on any other site I have checked with same design I see topology changes from normal ports and etherchannel uplinks, not stackports.
My stackwise rings are closed normally at full 32g and they are connect like this:
Stackwise port 2 to 1 of the next switch, and the last switch port 2 to the switch 1 port 1.
07-20-2015 12:53 AM
Update:
- No mac flapping, no host move during loop
- The |"from stackport" stp update is due to the fact that my master is not the switch with the uplink.
- 2 vlans are only stormed, data and voice. No wireless, no printers, nothing and we have many. In detail the spamming users of these vlans are spamming any broadcast that is allowed. Usually it is arp to find the default gateway or it can even be an old decommissioned printer.
When a user plugs in his laptop to the phone I can see his phone mac two times one for each vlan (data and voice) plus extra the user nic mac on the user vlan. Like this:
sh mac address-table | i 2/0/9
10 28d2.4467.8fb7 DYNAMIC Gi2/0/9 <-user mac on user vlan
10 580a.20fb.b357 DYNAMIC Gi2/0/9 <-- phone mac on user vlan
200 580a.20fb.b357 DYNAMIC Gi2/0/9 <--- phone mac on voice vlan
Is this normal. Or should it be only one 10 and one 200, for user and vlan respectively?
07-20-2015 01:32 AM
Hi,
no, you should not see the MAC of the laptop in the voice LAN.
Is it possible that your phone is missconfigured and is bridging between Voice and Data VLAN somehow?
Regards,
Markus
07-20-2015 04:08 AM
I can see the above when the user is plugged. If there is no user I have the last two records (the phone mac twice 10, 200). The setup is tagged voice vlan, tagged user vlan. I get also the same feedback from other sites of the company. I can see everywhere the phone mac. Only they do not have storms.
At this moment, every active employee's port has three records. They can't have all bad cables, especially with bpdu guard enabled. As for the config I will have to talk with the telecoms team though I know they have done the same config for other cities.
The last thing it is there for me to check are some extreme broadcast packet ports. Their cables setup is ok, but they do have VMs. They are not bridged but I will have to try them myself. According to them they are in NAT mode. Or else I am kinda in a dead end.
I am thinking in NAT mode, it may be even worse. If anything loops within the host, with no bpdu reaching the switch ports, then these broadcasts (like def.gw arp?) should be sent out to the default gw. Sadly my vm experience is far lower than networks so I am not sure what software can do what implementation and what kind of networks it can bridge or just storm. I have seen some of our users so far using Oracle virtualbox, mac parallel, and a vmware player.
Thanks for the replies guys. It's been a great help so far.
07-20-2015 09:39 AM
Hello,
The phone mac address can be seen on both data and voice vlan due to CDP packets. It is expected on 3750.
Thanks,
Madhu
07-17-2015 03:22 AM
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide