Re: Bizarre DHCP snooping causing spanning tree issues on 3650s on reb

hemmerling · ‎09-05-2024

Can someone explain what could possibly be causing the problem we've been seeing recently.
We have had to implement ip dhcp snooping and it's having a bizarre interaction with spanning tree when certain 3650 devices reboot.

Maybe it's because we're writing the dhcp binding table to the flash so it survives the rebooting and that is causing a weird interaction at boot. I don't now.

But what seems to be happening is that certain 3650 switches running 16.12.10 or higher that have ip dhcp snooping applied will randomly on a reboot start advertising the lowest interface's mac address as a new root for all allowed vlans, and this is often the either unused or completely disabled Gi 0/0 (RP management port).

Because we have loop and root guard applied everywhere this will start blocking various trunk ports. "%SPANTREE-2-LOOPGUARD_BLOCK: Loop guard blocking port GigabitEthernet1/1/1 on VLAN0xxx"
If the RP port (or lowest mac on the device) isn't disabled you can see this as the MAC address in one of the other switches that is showing the spanning tree root change, it shows up as "%SPANTREE-5-ROOTCHANGE: Root Changed for vlan xxx: New Root Port is GigabitEthernet1/1/1. New Root Mac Address is xxxx.xxxx.xxxx"
It doesn't matter that the port is unplugged and down/down, it still somehow is now advertising that it's the new root.
If it's admin shut (which most of ours are) the other switch doesn't show the mac address, the only mac you'll see is when the root changes back to what it was before it started blocking.

The only reason I know that it's DHCP snooping related is that it was the last change on the devices and when we remove it, the new root advertisement stops. It absolutely stops when the "ip dhcp snooping vlan xxx" is removed from the offending switch that is doing the root advertisement from it's RP port.

This is clearly a bug, and we had no issues for weeks after configuring dhcp snooping, but after it was running for a few weeks when we were doing switch IOS upgrades it started happening on some of them following reboots.

I can find no other case of this online, and I know it seems crazy to say they're connected but they are connected, somehow.

Here are the various errors someone will see when it's happening:
%SPANTREE-5-ROOTCHANGE: Root Changed for vlan xxx: New Root Port is GigabitEthernet1/1/1. New Root Mac Address is xxxx.xxxx.xxxx
%SPANTREE-2-LOOPGUARD_BLOCK: Loop guard blocking port GigabitEthernet1/1/1 on VLAN0xxx
%SPANTREE-2-LOOPGUARD_UNBLOCK: Loop guard unblocking port GigabitEthernet1/1/1 on VLAN0xxx.
%SPANTREE-5-TOPOTRAP: Topology Change Trap for vlan xxx

And it will happen on every allowed VLAN, and if the RP (or lowest mac address on the interfaces) port is not disabled you will see that as being advertised as the new root, but if it's admin shut then you won't see who is the new root, just that there is one and then the blocking. You will have to prune to find the offending device.

It did it on 16.12.10 and 16.12.11 on the 3650s, but the spanning tree changes impact every switch sharing the same VLANs.
I'm mostly asking this here so that the next unlucky network tech has a hope of knowing why their network is going crazy when it happens to them.

Has anyone seen this before?

MHM Cisco World · ‎09-05-2024

There is no relation between dhcp snooping and stp.

MHM

MHM Cisco World · ‎09-05-2024

Before the SW reboot check CPU

It can dhcp snooping effect cpu and hence it cannot process bpdu and that lead to topology change

MHM

David Ruess · ‎09-05-2024

Hello,

For starters, based on your entire explanation, Spanning-tree is using the default priorities as you mention its falling back to the MAC address for the election. You could just change your priorities of the Root bridges to be lower than the default advertised by all other switches and the switches wont be able to use their MAC addresses to take over the root bridge.

Secondly, I have never heard of IP DHCP snooping affecting spanning tree, so I'd lean more to a bug as you mentioned. Also you mention DHCP snooping was working for weeks without issue but after the "upgrade" you started experiencing issues. I would say it may be an image issue rather than a snooping issue as well.

Have you tested it on other devices makes/models? What were the results?

-David

hemmerling · ‎09-05-2024

We have the priority of the actual root set to 8192, and we are running rapid-pvst everywhere.

I like the idea of it being the CPU being pegged in the offending devices and that being the reason why it starts behaving erratically.
The issue is I can't test it now, I'd have to turn dhcp snooping back on and wait for it to happen again. And because it uses the RP port and because we admin shut them, I can't easily see which switch is the offending one when there are many switches coming back online following a power failure or a reboot.
I was taught to not set the root to 0 on the core switch to give flexibility for the future, but why would a switch be failing to a priority less than 32768 in the first place let alone below 8192, if that is what happening?

It is absolutely related to the dhcp snooping being enabled on the offending switches, because when it's removed, it stops instantly declaring it's admin shut RP is root to all that can hear it. The reason why needs to be addressed by Cisco.

I need to capture traffic while it's happening to better understand what mac it's sending when the RP is admin shut and what priority it's advertising out on the invisible mac, but it is absolutely some sort of bug with the 3650s and only happens when you have dhcp snooping enabled.

David Ruess · ‎09-05-2024

When you test it again and another switch tries to take over the root - log into that switch and do a show spanning-tree summay. It should give the priority/mac it's using in its calculation.

It's definitely odd it takes weeks for the problem to appear but only seconds for it to go away when DHCP snooping is enabled/disabled.

Check logs and debugs of spanning-tree and dhcp snooping while they are configured together. Maybe it will indicate there.

hemmerling · ‎09-05-2024

Would if I could, but while it's happening the management VLAN is blocking, it stops doing it and then I can get in, but then the root is back to normal, I'd have to be consoled in.
I'm going to try recreate it in a controlled way so that I can observe it.
It will just keep cycling for days between who's root if I let it (it happened over a weekend the first time on only 2 switches).
It works fine for weeks, but when a device reboots (while dhcp snooping in on one of the vlans), the chance of it doing it when it comes back up goes up.
I really want to see what priority the RP port has when it's doing this, it's either 0 or 4096 to override the existing..

paul driver · ‎09-05-2024

Hello

@hemmerling wrote
:loop and root guard applied everywhere

SPANTREE-5-ROOTCHANGE: Root Changed for vlan xxx: New Root Port is GigabitEthernet1/1/1. New Root Mac Address is xxxx.xxxx.xxxx
%SPANTREE-2-LOOPGUARD_BLOCK: Loop guard blocking port GigabitEthernet1/1/1 on VLAN0xxx
%SPANTREE-2-LOOPGUARD_UNBLOCK

RED FLAG! - these two commands are mutually exclusive- meaning they should NOT be applied together, if they are applied to the same port, root guard will be disabled, And by the look of the above log buffer you are actually doing this?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Giuseppe Larosa · ‎09-05-2024

Hello @hemmerling ,

>> Because we have loop and root guard applied everywhere this will start blocking various trunk ports. "%SPANTREE-2-LOOPGUARD_BLOCK: Loop guard blocking port GigabitEthernet1/1/1 on VLAN0xxx"

I had a customer that had serious issues with the command:

spanning-tree loopguard default

that was applied to all switches at distribution and access layer.

After a major incident that caused 3 hours of out of service we as the incoming tech support partner had the task to investigate.

(our support contract was starting then).

spanning loopguard default was not a wise command.

Our remediation campaign was to remove the commands

spanning-tree loopguard default

spanning-tree rootguard default

To configure on single interface basis the appropriate protection:

root guard on distribution layer interfaces ( pointing away from the root bridge(s) )

loop guard on access layer uplinks ( direction pointing to the root bridge(s) )

The most important action to improve STP, PVST in those times, was to manually prune VLANs on trunks ( the list of allowed VLANs on trunks) so that the number of PVST instances running on each access layer switch was minimized.

I would suggest you to review the STP design of your network.

In theory there shouldn't be a direct interaction between DHCP snooping and STP, however DHCP snooping may be causing the sudden reload of some switch that is followed by the cycle of events that you describe.

Hope to help

Giuseppe

hemmerling · ‎09-09-2024

Unfortunately we are forced to try to have both on to be compliant with CISC-L2-000090 and CISC-L2-000110.
I didn't mean to imply that we had both commands on the same interface, the log examples were just examples from various switches at the time it happened.
We have do have loop guard enabled globally per that STIG, and all trunks to access switches have root guard on their trunk.
We enable loop guard on the access switches going back towards the core and root guard to any other access switches connected to those switches.

paul driver · ‎09-09-2024

Hello
loopguard globally and root guard on all trunks to access switches

Same thing i’m afraid - you need to take loop guard globally off and apply it specifically at interface level

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

hemmerling · ‎09-09-2024

@paul driver wrote:
Hello
loopguard globally and root guard on all trunks to access switches

Same thing i’m afraid - you need to take loop guard globally off and apply it specifically at interface level

Sadly, I can't. It's a CAT II finding, but I'm pretty sure on our old Layer 3 that it's either one or the other anyway, that means that it's not actually applying the loop guard to the interfaces with root guard specifically already applied.

So that still leaves me with why DHCP snooping being enabled for the user VLAN on these 3650s when they reboot it still makes 1 in 30 claim it's admin shut RP port is the new root.

I know why the logs were showing the blocking errors, I was looking for the link as to why DHCP snooping can trigger it. The layer 3 switch is doing its job, it's blocking the access switch that claims to be root every 3 minutes or so following a reboot if DHCP snooping is enabled of that 3650. (and using a local flash-based binding database file if that's relevant).
Again, during that 3 minute window when the vlans aren't blocking if I can log into the 3650 and turn of DHCP snooping, it stops claiming it's RP port is root, instantly.

Giuseppe Larosa · ‎09-09-2024

Hello @hemmerling ,

you mean you need to follow security recommendation like the following:

https://app.xylok.io/reference/benchmark/cisco_ios_xe_switch_l2s_stig/check/f57199f6-535d-449d-b894-f5fb28b45fc8/?version=524af28f-4764-4894-b235-e79e167e340d

where there is loop guard default enabled :

spanning-tree mode pvst
spanning-tree loopguard default

As I have explained this is not a good thing.

The existence of a command at global level is not a good reason to use it.

Again, you should review your network STP design. You should have enough evidence of the bad effects of the command together with DHCP snooping in your case.

Hope to help

Giuseppe

paul driver · ‎09-09-2024

Hello

@hemmerling wrote:The layer 3 switch is doing its job, it's blocking the access switch that claims to be root every 3 minutes or so following a reboot if DHCP snooping is enabled of that 3650. (and using a local flash-based binding database file if that's relevant).
Again, during that 3 minute window when the vlans aren't blocking if I can log into the 3650 and turn of DHCP snooping, it stops claiming it's RP port is root, instantly.

If dhcp snooping is blocking an untrusted port and as a consequence of that disables uplink ports which then cause stp to converge then you may have to think what active dhcp rogue devices do you have running on the network originating from untrusted ports?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

hemmerling · ‎09-09-2024

But the L2 switches advertising their RP interface (which is admin shut BTW) as root following a reboot have the uplink to the L3 trusted for DHCP snooping. There is no DHCP snooping applied to the layer 3 switch. When the RP port isn't admin shut and just merely down/down I can see it advertising that interface MAC as a new root, when it's admin shut the new root is still going out, but the MAC disappears from the other switches, but they still see a "new root this way" you just can find it as easily now.
I think I just need to catch it doing it and packet capture what the offending switches are sending out when they do this.

The way the bug seems to work in case anyone encounters it is:
IOS-XE 16.10 or higher on a 3650 when DHCP snooping is enabled and the database is persistent locally on the flash "sometimes" causes the device to advertise the Gi 0/0 RP (or lowest MAC) interface as a new root upon a reboot, it does this every few minutes until you remove the "ip dhcp-snooping vlan xxx" command, then it stops, putting it back on doesn't make it start back up, it only does it following a reboot.

Bizarre DHCP snooping causing spanning tree issues on 3650s on reboot