Solved: Re: Spanning Tree Changes - Page 3

MP13 · ‎03-07-2023

Hi,

I am currently reviewing our STP setup as we are getting a number of regular storm control alerts even though it's set to 10% of a 10Gbit port for multicast and broadcast traffic and considering port utilisation for all traffic is typically only 1Gbit I think we have some issues with some ports and we're seeing STP transition from listen to block every few minutes on some VLAN's.

We've also found that if we add or remove a VLAN from trunk ports then over the course of about 15 minutes we get a growth in multicast and broadcast traffic until the Nexus loop detection kicks in and stops MAC learning for 180 seconds which causes around a 3 minute outage of the network and then it returns to normal.

I'm planning during a maintenance window to try and get our STP topology uniform and configured better and also conduct a packet capture on a mirror port to work out exactly what traffic is causing the storm too. It may not be STP but it seems to have started since we added some new switches to the network.

I have attached a quick mock-up diagram of our network which shows where we have Rapid PVST and where we have just PVST. My plan is to change all of the switches to Rapid PVST which will hopefully also help with our convergence time. Current convergence time is about 30-60 seconds which from reading around the community and some other sites although it's ok to mix modes the convergence time that Rapid PVST should bring to a few seconds will not occur because of the devices just running PVST.

My plan on all of the switches running PVST is to just issue "spanning-tree mode rapid-pvst". I just want to check if there is anything in particular that I should watch out for when doing this or whether there are any additional steps recommended?

We also have a couple of customer networks that link off of ours which I have shown on the diagram. To my knowledge at the moment our STP merges with theirs for the VLAN's that we provide them. Should we be making some changes to block STP at our edge to them so that their STP is then completely separate to ours?

Last of all is root bridges. At the moment we don't specify what the root bridges should be using any priorities. Would it make sense to change this and have the Nexus 5K's just under the ASR's as the primary and secondary root bridge?

I've tried to read up on things a fair bit but really just looking to clarify some additional queries here in relation to our own network using the attached topology diagram.

MHM Cisco World · ‎03-14-2023

Number of topology changes 15 last change occurred 01:35:31 ago <<- if network is stable then this value must be lower than 15,

I start suspect that NSK CoPP drop some BPDU frame this make other SW re-elect new bridge, may be this is case here.

MP13 · ‎03-14-2023

I've just seen the losses on another VLAN just now and ran the command again and confirmed that the losses were at the same time as an STP change.

VLAN0345 is executing the rstp compatible Spanning Tree protocol
Bridge Identifier has priority 4096, sysid 345, address 547f.eed3.5981
Configured hello time 2, fex hello time 12, max age 20, forward delay 15
We are the root of the spanning tree
Topology change flag not set, detected flag set
Number of topology changes 458 last change occurred 0:06:45 ago
from port-channel1
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0

It seems like it happens at different times for different VLAN's. Just to note VLAN 345 won't be on the previous output's as there are about 100 VLAN's in use so I just cut down the output to show a handful of them previously.

MHM Cisco World · ‎03-14-2023

there is serious issue
458 topology change,, to high number.
check CoPP in NSK for BPDU drop

MP13 · ‎03-14-2023

Not sure if this is what I'm right checking:

sh copp status
Last Config Operation: None
Last Config Operation Timestamp: None
Last Config Operation Status: None
Policy-map attached to the control-plane: copp-system-policy-default

Doesn't look like there are any drops in any part of the policy:

sh policy-map interface control-plane | i violated
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;
violated 0 bytes;

Only parts matching traffic are igmp, bridging, lldp_dcx, cdp but only conformed packets.

MP13 · ‎03-14-2023

With you highlighting where the STP topology change comes from I've had a look and at the root bridge it says the change comes from the port facing the 6503.

On the 6503 it says the change comes from a port facing one of the 2960's. On the 2960 it says the change came from the 6503 on the same port the 6503 says.

Does that make it likely the issue is the 6503 if that says the 2960 caused the change but the 2960 says it was the 6503 or is that just them agreeing on it?

MHM Cisco World · ‎03-14-2023

share the STP config in both SW let me check

MP13 · ‎03-14-2023

On the 6503:

spanning-tree mode rapid-pvst
spanning-tree extend system-id
no spanning-tree vlan 91

On the 2960s:

spanning-tree mode rapid-pvst
spanning-tree etherchannel guard misconfig
spanning-tree extend system-id

The no spanning-tree vlan 91 is for a port an ISP feed which they said we had to disable STP on.

I'm going to get a packet capture as well and see what the storm control events we are seeing are.

Not sure if it's relevant but on the 2960's the Te1/0/1 and Te1/0/2 we have broadcast and multicast limits on storm control se to 1%.

We are not seeing any storm control events in syslog for these so I don't think it's that we're dropping packets on those leading to the issues but just raising it as a config item:

Interface Filter State Upper Lower Current
--------- ------------- ----------- ----------- ----------
Te1/0/1 Forwarding 1.00% 1.00% 0.00%

MP13 · ‎03-14-2023

The packet capture hasn't shown anything overly as an issue at the moment but I may need to run another one as I only got one storm-control event.

We did make some changes whilst we had the capture on-going and the previous issue where we would get the loop detected on the nexus core and then a complete outage does seem to be ok since the changes at the weekend so it appears we've just got this new issue which is probably related, just not as severe.

I did manage to capture one of the STP changes and one of the packets exchanged the information is below. This happens for about 15 VLAN's. I'm not sure if it helps at all but just trying to gather as much information as possible. This happened after adding a new VLAN to some interfaces using the "switchport trunk allowed vlan add" command so the VLAN's that then had the STP packets should not have been affected.

Frame 1407370: 64 bytes on wire (512 bits), 64 bytes captured (512 bits)
IEEE 802.3 Ethernet
Logical-Link Control
Spanning Tree Protocol
Protocol Identifier: Spanning Tree Protocol (0x0000)
Protocol Version Identifier: Rapid Spanning Tree (2)
BPDU Type: Rapid/Multiple Spanning Tree (0x02)
BPDU flags: 0x3c, Forwarding, Learning, Port Role: Designated
Root Identifier: 4096 / 169 / 54:7f:ee:d3:59:81
Root Path Cost: 0
Bridge Identifier: 4096 / 169 / 54:7f:ee:d3:59:81
Port identifier: 0x8088
Message Age: 0
Max Age: 20
Hello Time: 2
Forward Delay: 15
Version 1 Length: 0
Originating VLAN (PVID): 169

Looking at the rest of the traffic captured there isn't any clear traffic causing the storms. I did notice however that when a storm happened on a different port to the one I was monitoring on the 6503 I then got no traffic showing on my capture port for about 30-60 seconds.

I have about 40 minutes worth of capture data so can search for particular traffic during that period if it will help at all.

Whilst doing this we did find a switch where VLAN 1 was not on the trunk ports and it was generating a native VLAN mismatch error even though the native VLAN was 1 on both. We've added VLAN 1 to it now and that message has stopped. I guess that would cause BPDU's to not get across in RTSP mode too which could be an issue?

MHM Cisco World · ‎03-14-2023

VLAN 1 was not on the trunk ports <<- I previous mention that you must SO sure that VLAN1 is allow in trunk, you confirm that it allow.
sure all of this mess can happened from not allow VLAN1 in trunk.
also you must sure that the native VLAN match in all network.

MP13 · ‎03-14-2023

Thanks. I had seen you mention that previously which is why I raise it now as we normally add VLAN1 to all trunk ports between switches but this for some reason was an outlier.

I'll monitor it over the next few hours and see if we get any further events and report back here.

MP13 · ‎03-14-2023

Unfortunately this wasn't the cause and we are still seeing the topology changes:

VLAN0345 is executing the rstp compatible Spanning Tree protocol
Bridge Identifier has priority 4096, sysid 345, address 547f.eed3.5981
Configured hello time 2, fex hello time 12, max age 20, forward delay 15
We are the root of the spanning tree
Topology change flag not set, detected flag set
Number of topology changes 481 last change occurred 0:38:01 ago
from Ethernet1/8
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0

MP13 · ‎03-15-2023

One good update on this is that since adding the VLAN1 tag to the port it was missing on the storm control events seem to have stopped and we've now had no new alerts for about 14 hours and they were happening every 30 minutes before.

On the STP issue I worked to simplify the paths so that there was only 1 path to the 2960's. I then started to see better stability but one particular VLAN still had an issue which incidentally also seems to be the worst affected:

Number of topology changes 252792 last change occurred 0:00:57 ago

With only having a single path now and seeing this VLAN clearly still having STP issues I followed where STP was saying the change was triggered from and it turned out to be a port on one of the 2960's which faces a server. This is just a standard access port.

I've shut that port and now had the longest period without any STP changes that we've had in the last few days. I'll keep monitoring it over the course of the day and report back later if the STP changes remain stopped.

If they do, I'll then work to re-enable the paths and see if the issue stays stable and if it's been down to that server.

Once I have confirmation from monitoring I'll update here.

MHM Cisco World · ‎03-15-2023

Server with VM ?

MP13 · ‎03-15-2023

Unfortunately I don't have access to it or the person which manages it at the moment to find out however it hasn't solved the issue anyway but has led to less STP changes on that particular VLAN.

Since simplifying things and making it single path we seem to have had better stability on most VLAN's but still a handful which are changing topology around once per hour.

When we see the STP changes as we trace what link has caused it we get to the bottom 2960 distribution switches and find the below:

6503 -> Topology change from TenGigabitEthernet1/3
2960 -> Topology change from TenGigabitEthernet1/0/1

These 2 ports connect to each other. The fact they both say the same link caused the change is that them agreeing it was that link that was the cause?

The 2960 currently only has a handful of ports active, 2 which face external networks and 2 which are just access ports. The VLAN that these topology changes comes from is on none of those ports and is only active on the trunk port between the two switches at the moment.

MP13 · ‎03-27-2023

I have accepted this as the solution as it was the right problem but just not on the NSK.

We did some further work on this and found that we had a 2960 which if we moved devices above the problem went away and that topology changes were electing switches below this particular 2960 as the new root and then changing back to the NSK.

We're not sure if the switch was bad or just couldn't handle the traffic however it was getting some CRC's and OutDiscards. The CRC's we found were due to a faulty daughterboard that does the dual personality ports on the 2960 so once we stopped using that the CRC's stopped but the STP changes continued.

We had a maintenance window for some other changes so we ended up replacing the 2960 with a Nexus 3K that we had spare which is more suitable for the number of packets and traffic being exchange on the affected switch and since then STP changes have been stable and only aligned with actual topology changes.

Thank you for your help on this.