cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
4493
Views
13
Helpful
44
Replies

Spanning Tree Changes

MP13
Level 1
Level 1

Hi,

I am currently reviewing our STP setup as we are getting a number of regular storm control alerts even though it's set to 10% of a 10Gbit port for multicast and broadcast traffic and considering port utilisation for all traffic is typically only 1Gbit I think we have some issues with some ports and we're seeing STP transition from listen to block every few minutes on some VLAN's.

We've also found that if we add or remove a VLAN from trunk ports then over the course of about 15 minutes we get a growth in multicast and broadcast traffic until the Nexus loop detection kicks in and stops MAC learning for 180 seconds which causes around a 3 minute outage of the network and then it returns to normal.

I'm planning during a maintenance window to try and get our STP topology uniform and configured better and also conduct a packet capture on a mirror port to work out exactly what traffic is causing the storm too. It may not be STP but it seems to have started since we added some new switches to the network.

I have attached a quick mock-up diagram of our network which shows where we have Rapid PVST and where we have just PVST. My plan is to change all of the switches to Rapid PVST which will hopefully also help with our convergence time. Current convergence time is about 30-60 seconds which from reading around the community and some other sites although it's ok to mix modes the convergence time that Rapid PVST should bring to a few seconds will not occur because of the devices just running PVST. 

My plan on all of the switches running PVST is to just issue "spanning-tree mode rapid-pvst". I just want to check if there is anything in particular that I should watch out for when doing this or whether there are any additional steps recommended? 

We also have a couple of customer networks that link off of ours which I have shown on the diagram. To my knowledge at the moment our STP merges with theirs for the VLAN's that we provide them. Should we be making some changes to block STP at our edge to them so that their STP is then completely separate to ours?

Last of all is root bridges. At the moment we don't specify what the root bridges should be using any priorities. Would it make sense to change this and have the Nexus 5K's just under the ASR's as the primary and secondary root bridge?

I've tried to read up on things a fair bit but really just looking to clarify some additional queries here in relation to our own network using the attached topology diagram.

44 Replies 44

Yes you are right 
Peer(STP) meaning neighbour SW run legacy STP not RSTP.

Which one of the 4 switches in the middle is the root? Is it the same root for all VLAN's? Do you see any 'mac address flapping between ports' messages in the log? I ask because you could have different switches preferring ports with higher priority making spanning tree not work correctly. I agree with @MHM Cisco World on where the problem is located.

The root bridges are a bit all over the place with some switches being root bridges for various VLAN's and other's being bridges for others. This was part of why I was querying whether we should change things so that the top left Nexus is the root bridge for example and then if the top right Nexus should be set in priority order so it's the secondary?

We don't see messages about MAC's flapping between ports on the Nexus switches in the middle but we do on the 2960's however from looking into that it seems to mostly be where there are devices that use switch independent NIC teaming in Windows Server and we've been gradually changing those to active/passive instead which stops the message.

On the Nexus switches when we have the broadcast/multicast issue when changing trunk ports anywhere on the network we start to see the messages "A network loop has been detected in the Nexus core" and then once it goes into it's disabling of dynamic learn notifications a couple of times the issue then stops after about 15 minutes in total.

On the 6503 a few times a day we get "A storm control event has been detected" on a few of the ports. The limit on those for storm control is set to 10% for broadcast and multicast and there is no unicast limit set. 

I'm not sure if there is any tool that could perhaps map STP that may make this easier to visualise?

Root bridges being all over the place is not a good thing for the network in my experience. If you look at the topology of a network, the logical choices for the root are usually pretty obvious. My preference has been to change to priority so that 1 switch will always be the root, and then designate 1 switch slightly less so it would be the backup. You can go further, but you will likely have more severe problems if 2 of your key switches are out of commission.

I am more than a little puzzled by the topology of those 4 switches in the middle. I suspect it evolved that way, but it looks less than ideal to me. You would have many less spanning tree issues if you tied your access switches (I assume that is what the 2600 switches do) to a  Nexus VPC cluster with LACP port channels. Then you would be able to use both uplinks instead of having one of them in a blocking state.

Thank you. So if I put priority 4096 on the top left and 8192 on the top right and then just let the rest work out from there?

Is there anything to watch out for when changing the 2960's from PVST to Rapid PVST? I assume we'll have re-convergence but given that the two are compatible at the moment we shouldn't see any other issues in the change or need to make any other allowances? 

The 2 ASR's and the 2 top switches are in a meet-me room where all of our fibres come in to the building. The 2 switches below that are then in our data suite and then from there those feed our access switches. This is the reason for the 4 switches in the middle. 

One final question is how I should deal with the external networks? Should I be blocking BPDU's from those so that's the edge of our STP domain and they then have their own?

I have seen very few occasions where it has been necessary to block BPDU's, so I wouldn't start there. You can try forcing the root bridge to the one that makes the most sense with a secondary. I still think a VPC cluster of the Nexus switches would make a better and more resilient core, but the above certainly gives you somewhere to start.

Just to clarify, the blocking of BPDU's would only be where we then connect to an external network. From the diagram this would be the 2 switches/networks shown out to the left. I was mainly thinking that STP shouldn't be passing between our own network and an external network.

The VPC cluster is new to me. I'll do some reading up on this. 

You list 2 switches as 'external network'. That sounds like it could be redundant links. I would not start blocking BPDU's unless you have a specific problem that BPDU's are causing. All IMHO, of course.

Essentially they are other people in the same building which piggy back off our network so we are essentially providing them an internet connection but in some cases that does include having redundant links.

We will leave it as it is in that case. Thank you.

I will report back here on the changes once we make them at the weekend.

MP13
Level 1
Level 1

We have now completed the maintenance window and have set the top left switch to priority 4096 and top right switch to 8192. This is now looking a lot better with the top left being the root bridge for the VLAN's.

We also then worked through all of the switches changing them all to Rapid-PVST. 

Following the changes we changed the VLAN's on one of the trunk ports and didn't see the same build up of multicast traffic over the 15 minute period then leading to the 3 minute outage we were seeing. We are also not seeing the Nexus 5K's report a loop detected either when this happens. It may be a bit early to say for sure but initially it seems better at least even if the problem isn't gone entirely.

To update on the earlier STP output for confirmation too please find it below:

VLAN0125
Spanning tree enabled protocol rstp
Root ID Priority 4221
Address 547f.eed3.5981
This bridge is the root
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec

Bridge ID Priority 4221 (priority 4096 sys-id-ext 125)
Address 547f.eed3.5981
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec

Interface Role Sts Cost Prio.Nbr Type
---------------- ---- --- --------- -------- --------------------------------
Po1 Desg FWD 1 128.4096 P2p
Eth1/2 Desg FWD 2 128.130 P2p
Eth1/6 Desg FWD 2 128.134 P2p
Eth1/7 Desg FWD 2 128.135 P2p
Eth1/8 Desg FWD 2 128.136 P2p
Eth1/18 Desg FWD 4 128.146 P2p
Eth1/20 Desg FWD 4 128.148 P2p

On the 6503 we are still getting regular storm control events detected and via monitoring we're still seeing periods where at least a few of the VLAN's drop packets for about 20 seconds and then come back. This happens perhaps once every 2-4 hours. Combining the two events it would seem something still isn't quite right.

On the 6503 the ports showing the issue have the below storm control settings:

interface TenGigabitEthernet1/1
storm-control broadcast level 10.00
storm-control multicast level 10.00

interface TenGigabitEthernet3/1
storm-control broadcast level 5.00
storm-control multicast level 5.00

Te1/1 is higher because we kept increasing it to see if the traffic was genuine and we just needed to find the right level but at 10% of a 10Gbit port and still being triggered that seems unlikely so I'm wondering if this may be linked to the STP issue too. Some of the VLAN's affected go across these 2 ports but one that's affected isn't.

I did try a monitor port on Te1/1 for a short period but had to stop it after a couple of minutes as my device was struggling. Sorting by the header and identification columns on the traffic I capture though I couldn't see any noticeable repeats of the same packets. I can probably arrange a better device to do a longer capture if this is the right path to go.

@Elliot Dierksen @MHM Cisco World - Just to add to the above it looks like we are actually seeing some regular drop-outs since the change. It's quite random, sometimes an hour apart, sometimes several hours apart. 

I have noticed an increase in the amount of storm control events on the 6503 since the changes as well so it could be related. I'll conduct another capture in wireshark shortly and see if I can find anything but just looking for any advice either of you may have as well.

I have done a quick review and checked that the storm-control events don't tie up with when we see the losses. The losses look like something probably with STP but if it happens periodically I'm not sure on the best way to capture that.

What do you see in "show spanning-tree vlan 125 detail"? In particular I am interested in the topology change information. When there is a topology change, bridges must flood frames instead of filtering them. Something like this is what I mean.

EBD-3850#sh spanning-tree vl 3 det

 VLAN0003 is executing the rstp compatible Spanning Tree protocol
  Bridge Identifier has priority 4096, sysid 3, address 0042.5ac4.e600
  Configured hello time 2, max age 20, forward delay 15, transmit hold-count 6
  We are the root of the spanning tree
  Topology change flag not set, detected flag not set
  Number of topology changes 39 last change occurred 4d17h ago
          from Port-channel47
  Times:  hold 1, topology change 35, notification 2
          hello 2, max age 20, forward delay 15
  Timers: hello 0, topology change 0, notification 0, aging 300

I have included the information below which doesn't look like there are a significant amount of changes in topology. 

VLAN0125 is executing the rstp compatible Spanning Tree protocol
Bridge Identifier has priority 32768, sysid 125, address f07f.06ae.2440
Configured hello time 2, max age 20, forward delay 15, tranmsit hold-count 6
Current root has priority 4221, address 547f.eed3.5981
Root port is 1 (TenGigabitEthernet1/1), cost of root path is 2
Topology change flag not set, detected flag not set
Number of topology changes 15 last change occurred 01:35:31 ago
from TenGigabitEthernet3/1
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0, aging 300

I can include all the per port information as well if you need that too?

The last time we saw the losses was longer ago than 1hr 35m ago but is it worth running this after we see the issue as well?

On the 6503 the storm control limit is set to 10% at the moment so would be 1Gbit of multicast/broadcast and the average for all traffic on the port reported that gets the storm control limit only averages 1.06Gbit/s.

Does the time of the topology change correspond to when your network was experiencing problems? What is downstream on port Te3/1?

Sorry I think I ran that one from the 6503 as I was checking if that was the same as the root bridge as it's the 6503 where we are getting the storm control alerts. Te3/1 is a point to point connection that goes to another site about 30 miles away that there are a few VLAN's which span across the sites. There is only a single switch in that other site so it's a bit like the downstream 2960's but is an SX550X.

Review Cisco Networking for a $25 gift card