Solved: Instability in the Spanning Tree

Mike P · ‎03-20-2015

Okay, so yesterday we noticed a lot of churn on the network in regards to STP. There were mass TCN's for multiple VLAN's since Brocade also uses PVST. After reviewing the effect of TC's on the network, I realize it doesn't necessarily mean a total STP re-convergence event, but changes STP only for affected direct/indirect links participating in the spanning tree. However, I am focusing on insignificant topology changes which are clients either powering off or powering on their machines. Now, we should be running RSTP exclusively in our network, but I've come across multiple devices either configured for 802.1d or for no STP at all. This mix probably changes the behavior of TC and makes it less predictable to know what is going on.

So, given the fact that the network has none of the STP best practices implemented, a lot of access ports are not configured with BPDU Guard or PortFast, which is still used in 802.1w to mark an edge port. For 802.1w, a TC will occur if a port moves into a forwarding state (not a blocking state). So that means, if a user restarts their computer or any other device when it begins forwarding, if it is not configured for PortFast, then it will trigger a TCN. This TCN will propagate through 802.1w devices causing the devices that receive to flush their CAM table. Then the switch has to relearn all MAC addresses except those that are active and communicating.

For a single client, this is not that big of a deal and is considered mostly cosmetic. But, if you have enough clients on a VLAN changing their status, if 802.1d BLK/FWD, if 802.1w FWD, then you would see TCN's being sent to each switch in the spanning tree. And regardless of the STP type, it will eventually cause those switches to flush their CAM table. It will cause outages to users, briefly, while the switches repopulate their CAM table.

Now, I need to go back to review the logs on the switches, but since I am restricted to 100 lines in the buffer I doubt I'll be able to catch a lot of information unless I am right there when it is happening.

So, my working theory is this: edge ports are not configured properly, customers changing the state of their device by power cycling it causes TC events because spanning tree considers them as part of the forwarding topology, and hence we have situations where packets are being dropped, or devices are not reachable until a switch floods a frame and hears back from the owner and knows where to send it.

There are also other things I need to isolate and determine the cause of events, such as switches not receiving a BPDU from the root, electing itself as the root, then hearing a BPDU from the root and re-designating its place in the network as a bridge and determining its root port. This event coincides with the TC event and makes me wonder how the two are related. As far as I have read/can see, the TC should only cause MAC addresses to be flushed from the CAM and only those that are inactive. Doesn't that mean it is also flushing out the MAC of the root? And wouldn't the BPDU messages ensure that MAC is active in the CAM? This is really the next leap I need to make to answer this question. Any thoughts on the subject?

Jon Marshall · ‎03-22-2015

Mike

And if I am understanding your response correctly, then because the non-edge ports are constantly being flushed, it is even flushing entries for the root bridge?

Potentially yes, as all non edge ports, except the one the TCN was received on are flushed.

RSTP cannot maintain any entries for non edge ports because it can't know the topology hasn't changed and the entries it has could now be pointing the wrong way ie. to a path that has failed.

The loss of BPDUs from other switches could be more to do with the flooding but you won't know that until you at least get portfast configured.

TCNs are not in themselves a bad thing but an excessive amount together with all entries, including those for end devices because portfast has not been configured, being flushed can result in a lot of flooded traffic.

You may well have other issues in your network and by all means use this forum for help but like I say if you can eliminate unnecessary TCNs and flooding if you do still have an issue it will be easier to narrow down.

Jon

View solution in original post

Jon Marshall · ‎03-21-2015

Mike

With RSTP portfast is a double hit.

Not only does an end device generate a TCN when the port goes up or down but as you say a switch running RSTP then flushes it's mac address table.

But it only flushes the entries for non edge ports. But without portfast it doesn't know the end device ports are edge ports.

So the first thing you need to do is enable portfast on all end devices, or "portfast trunk" if any end device is trunking.

Without it, if enough clients are connecting and disconnecting, your network can be in an almost permanent state of topology change and this causes a lot of flooding.

I would do that as a priority and then see what other symptoms you are still seeing.

Jon

Mike P · ‎03-21-2015

Jon,

I appreciate the response. I'll definitely start marking client ports as edge ports to cut down on the topology changes. And if I am understanding your response correctly, then because the non-edge ports are constantly being flushed, it is even flushing entries for the root bridge? I thought RSTP would maintain active entries in the CAM even during a TCN event. Would that include bridge to bridge communication? If it is a complete flush, then that would explain why a switch looses its root bridge and what log messages I can capture, the events coincide (TCN/Root Bridge lost). And it does find its RBID within a second or so. Yeah, that might be the case. I'll be back in the office tomorrow and able to see what is going on and getting a better grasp of what is/is not configured correctly. Thanks again for your help!

v/r

Mike

Jon Marshall · ‎03-22-2015

Mike

And if I am understanding your response correctly, then because the non-edge ports are constantly being flushed, it is even flushing entries for the root bridge?

Potentially yes, as all non edge ports, except the one the TCN was received on are flushed.

RSTP cannot maintain any entries for non edge ports because it can't know the topology hasn't changed and the entries it has could now be pointing the wrong way ie. to a path that has failed.

The loss of BPDUs from other switches could be more to do with the flooding but you won't know that until you at least get portfast configured.

TCNs are not in themselves a bad thing but an excessive amount together with all entries, including those for end devices because portfast has not been configured, being flushed can result in a lot of flooded traffic.

You may well have other issues in your network and by all means use this forum for help but like I say if you can eliminate unnecessary TCNs and flooding if you do still have an issue it will be easier to narrow down.

Jon

Mike P · ‎03-22-2015

Jon,

I really appreciate you clarifying the underlying mechanisms for me. It's one thing to read about it and another thing to see it in the wild. Between STP best practices and the STP toolkit I am sure I can resolve the numerous issues we're having and if I have anymore questions about the technology at work I'll be sure to hit the forum. Thanks again.

v/r

Mike

Jon Marshall · ‎03-22-2015

Mike

No problem and please do come back if needed.

One thing I should have answered from your questions but didn't directly was the question of the mac address of the root switch.

The mac address that is important in the root switch election is the one contained in the BPDU not the source mac address of the BPDU. The source mac address is simply that of the port that transmitted the BDPU.

If a switch flushes it's mac address table it would remove that mac address but that would make no difference as to whether the switch believed it had lost it's path to root or not.

In terms of switch to switch communication BPDUs are sent with a multicast destination mac address so removing that mac address has no effect on BPDUs being exchanged.

So the fact that you are seeing the switch reporting it has lost it's path to root is not a direct consequence of the mac address being flushed because it doesn't need that to send and receive BPDUs.

However with all the flooding of end to end devices because of the flushing an indirect consequence may be that BPDUs are getting lost.

Apologies for not making that clearer.

Jon

Mike P · ‎03-22-2015

Jon,

This really does help me out a lot, at least being able to understand the processes that are going on behind the scenes. I think, like you suggested, if I begin fixing small issues and moving STP to a stable and predictable topology a lot of these issues will disappear. Thanks again, and if something makes my head scratch I'll be sure to come back!

v/r

Mike