Solved: Loopback Error Detected with STP configured

fgasimzade · ‎08-04-2022

Hello Everyone,

We were introducing a new pair of switches into the network. MSTP was configured on both switches and all around the network.

There was a loop between those 2 switches when we were configuring Etherchannel and some of the ports around the network were error-disabled due to Loopback Error Detected. I understand that keepalive mechanism was in charge of this, I just dont understand, how keepalives were faster than STP being kicked-in on those switches where the loop occurred.

Is it possible that keepalives propogated faster than STP detected a loop?

Peter Paluch · ‎08-05-2022

Hello,

I just double checked with our network engineer, first he configured Etherchannel on both switches, but somehow it did not work out on one of the switches - ports went to err-disable state.

Aha. I suspect that the error message there looked something similar to this one:

*Aug 5 11:45:28.681: %PM-4-ERR_DISABLE: channel-misconfig error detected on Po1, putting Et0/0 in err-disable state

Would I be correct?

This would be the EtherChannel Misconfig Guard jumping into action. In short, the EtherChannel Misconfig Guard observes the source MAC address of the STP BPDUs received across an EtherChannel. If the neighboring switch sends them from an active EtherChannel itself, they will all be sourced from the same MAC. However, if the neighboring switch is not bundling the ports into an EtherChannel and continues to treat them as individual switchports, the source MAC addresses of the sent BPDUs will be different. EtherChannel Misconfig Guard detects this and eventually err-disables the receiving EtherChannel to prevent bad things from happening.

Most often, this happens when one switch is configured for the static EtherChannel ("channel-group ... mode on") while the other is not configured yet. It takes less than a minute for EtherChannel Misconfig Guard to trip. I suspect that this is what happened initially.

Then he removed etherchannel config completely from this switch only, leaving ethernet cables connected.

Aha. And the other switch kept the EtherChannel configured? Because then one scenario for a permanent switching loop would be this one:

Let's assume that the root switch is one of Sw1, Sw2, or Sw3 (any of them would do). If Sw3 had the two links to Sw4 still statically bundled in an EtherChannel, it would be sending out only one BPDU for the entire EtherChannel, not one per link, and the BPDUs would be carried by only one of the two links, always the same one. Let's assume that it's the top one in the diagram. So, to Sw4 that has the EtherChannel removed and the two links operating individually, the port receiving the BPDUs would be the Root port, and the other port that does not receive any BPDUs would eventually become Designated port - and both Root and Designated on Sw4 would be Forwarding. Note that this situation would be stable and STP would be unable to detect, prevent, or resolve it. It would stay indefinitely.

So here, in this situation, Sw4 would be having two Forwarding ports connected back to Sw3, and the disaster is done - if, after a MST TC event and global flush of the CAM tables, the keepalive from Sw1 gets flooded through Sw2 to Sw3 and to Sw4, then it will loop back to Sw3 and get switched back to Sw2 to Sw1... et voilà.

Does this seem like matching the sequence of events seen across your switched network?

Best regards,
Peter

View solution in original post

Peter Paluch · ‎08-04-2022

Hello,

If I may ask - to be sure about the exact event - does any of the switches that experienced the err-disabled ports still have the event logged in their logs ("show logging")? If so, could you share those lines from one or two switches?

In particular, are you certain this was due to loopback error, and not due to EtherChannel Misconfiguration Guard? If you configured a static EtherChannel (no PAgP, no LACP, only "mode on"), that one could be more likely.

Thank you!

Best regards,
Peter

fgasimzade · ‎08-04-2022

Hello Peter,

The error log was %PM-4-ERR_DISABLE: loop-back error detected on Gi2/0/48, putting Gi2/0/48 in err-disable state

SW1 - SW2 - SW3--SW4

Etherchannel was being configured between SW3 and SW4, Keepalive error occured on SW1, with uplink to SW2

Looks like keepalive propogation from SW1 was faster, than STP action on SW3 and SW4.

Is it possible?

Peter Paluch · ‎08-05-2022

Hello,

I suspect that this event must have been a consequence of a series of events, not a single one. Facts to consider:

Keepalive frames are sent with both their source and destination MAC address set to the same value - which is the MAC address of the port that originates the frame. Once again, this is very important to emphasize, in keepalive frames, Source MAC = Destination MAC.
Any well-behaved switch receiving such a frame can only take three forwarding actions:
1. Flood it through all remaining ports in the same VLAN, which is the case when it doesn't have the destination MAC in the CAM table yet. As a collateral, the switch will also learn the source MAC (identical to the destination MAC) on the incoming port.
2. Blackhole it completely, which is the case when it has the destination MAC in the CAM table already and the MAC points to the same port that the frame arrived on. As a collateral, the switch will also refresh the source MAC (identical to the destination MAC) learned on the incoming port.
3. Forward it out a single interface, which is the case when it has the destination MAC in the CAM table already but the MAC points to a different port. This is a rather pathological case because it means that the sender of the keepalive frame has apparently moved to a different port. As a collateral, the switch will also move the source MAC (identical to the destination MAC) to the new incoming port.
The logical consequence is that Sw2 could not have flooded the keepalive back to Sw1 by looping it internally. What could have happened is that Sw2 flooded the keepalive through other ports, and due to some other switching loop in the network, when the frame came back to Sw2, it forwarded it to Sw1 and caused Sw1 to declare a loopbacked port.
The same consideration goes for Sw3 if it received the keepalive from from Sw1 flooded by Sw2.
The only chance for a loop in this network topology is the double connection between Sw3 and Sw4 that was intended to operate as an EtherChannel. However, we don't know yet how exactly it was configured and put into production.
MST comes into play as a mechanism that, whenever a non-edge port becomes Forwarding, propagates a Topology Change (TC) event across the entire network. Upon receiving the TC notification, all switches flush their CAM tables for all non-edge ports (except the port where the TC notification was received). This mechanism on its own would be responsible for switches forgetting where Sw1 is.
Assuming that the EtherChannel, for some so far unexplained reason, misbehaved, what could have happened is that both the links between Sw3 and Sw4 came up, and as a result, MST advertised a TC event, effectively flushing the CAM tables across the switches in the topology. When Sw1 sent out its keepalive frame, it flooded through Sw2 to Sw3 then to Sw4 over one of the links, and Sw4 flooded it out the other link back to Sw3, Sw3 switching it back to Sw2, and Sw2 switching it back to Sw1.

This theory does have a few holes but so far, I can't come up with any better one.

Looks like keepalive propogation from SW1 was faster, than STP action on SW3 and SW4.

That wouldn't be how I know the operation of the switches. When ports are down, their switching logic is programmed to block all VLANs. When they come up, they come up as blocked already, and it must in fact be MSTP deciding that it's safe to unblock them. The only reason a switchport would come up implicitly unblocked would be having it configured as a PortFast (that is, edge) port. I certainly hope you have not configured your connection between Sw3 and Sw4 like that.

Can you comment on the exact types of switches and their operating system versions? What exact types of switches are Sw1 through Sw4, and what software is running there?

Best regards,
Peter

fgasimzade · ‎08-05-2022

Hello Peter,

Thank you for a detailed explanation. In fact, I was thinking in the same direction, that keepalive packet from SW1 travelled all way long to SW2-SW3-SW4 and was looped back from SW4 to SW3-SW2 and eventually back to SW1 on the same port, where it was originated from. This event must have happened because CAM tables on the switches in this chain were flushed due to TCN event of a port changing its status to UP when etherchannel was being introduced. I just double checked with our network engineer, first he configured Etherchannel on both switches, but somehow it did not work out on one of the switches - ports went to err-disable state. Then he removed etherchannel config completely from this switch only, leaving ethernet cables connected. I believe this is what caused this chain of events. However, still no clue how keepalives were faster to traverse than STP detecting a loop between SW3 and SW4

MHM Cisco World · ‎08-05-2022

BPDU is process and modify in each SW BUT the Loopback is small message forward without any process.
and sure you have L2 Loop since the Loopback massage receive from same port send it.

Peter Paluch · ‎08-05-2022

Hello,

I just double checked with our network engineer, first he configured Etherchannel on both switches, but somehow it did not work out on one of the switches - ports went to err-disable state.

Aha. I suspect that the error message there looked something similar to this one:

*Aug 5 11:45:28.681: %PM-4-ERR_DISABLE: channel-misconfig error detected on Po1, putting Et0/0 in err-disable state

Would I be correct?

This would be the EtherChannel Misconfig Guard jumping into action. In short, the EtherChannel Misconfig Guard observes the source MAC address of the STP BPDUs received across an EtherChannel. If the neighboring switch sends them from an active EtherChannel itself, they will all be sourced from the same MAC. However, if the neighboring switch is not bundling the ports into an EtherChannel and continues to treat them as individual switchports, the source MAC addresses of the sent BPDUs will be different. EtherChannel Misconfig Guard detects this and eventually err-disables the receiving EtherChannel to prevent bad things from happening.

Most often, this happens when one switch is configured for the static EtherChannel ("channel-group ... mode on") while the other is not configured yet. It takes less than a minute for EtherChannel Misconfig Guard to trip. I suspect that this is what happened initially.

Then he removed etherchannel config completely from this switch only, leaving ethernet cables connected.

Aha. And the other switch kept the EtherChannel configured? Because then one scenario for a permanent switching loop would be this one:

Let's assume that the root switch is one of Sw1, Sw2, or Sw3 (any of them would do). If Sw3 had the two links to Sw4 still statically bundled in an EtherChannel, it would be sending out only one BPDU for the entire EtherChannel, not one per link, and the BPDUs would be carried by only one of the two links, always the same one. Let's assume that it's the top one in the diagram. So, to Sw4 that has the EtherChannel removed and the two links operating individually, the port receiving the BPDUs would be the Root port, and the other port that does not receive any BPDUs would eventually become Designated port - and both Root and Designated on Sw4 would be Forwarding. Note that this situation would be stable and STP would be unable to detect, prevent, or resolve it. It would stay indefinitely.

So here, in this situation, Sw4 would be having two Forwarding ports connected back to Sw3, and the disaster is done - if, after a MST TC event and global flush of the CAM tables, the keepalive from Sw1 gets flooded through Sw2 to Sw3 and to Sw4, then it will loop back to Sw3 and get switched back to Sw2 to Sw1... et voilà.

Does this seem like matching the sequence of events seen across your switched network?

Best regards,
Peter

MHM Cisco World · ‎08-04-2022

loopback meaning there are L2 Loop.
check the Spanning tree FWD status

check PortChannel are all member "P" or "S"

MHM Cisco World · ‎08-05-2022

addition to @Peter Paluch said before and it 100% correct
check this link for more detail about how misconfig portchannel make Loop
https://www.dasblinkenlichten.com/port-channel-loops/

Jitendra Kumar · ‎08-04-2022

this is possible to meet with the bug.

https://quickview.cloudapps.cisco.com/quickview/bug/CSCea46385

Keepalives are sent on all interfaces by default in Cisco IOS Software Release 12.1EA-based software. In Cisco IOS Software Release 12.2SE-based software and later, keepalives are not sent by default on fiber and uplink interfaces. For more information, refer to Cisco bug ID CSCea46385 ( registered customers only) .

The suggested workaround is to disable keepalives and upgrade to Cisco IOS Software Release 12.2SE or later.

Thanks,
Jitendra

Jitendra Kumar · ‎08-04-2022

Temp Solution

no keepalive

interface command in order to disable keepalives. A disablement of the keepalive prevents errdisablement of the interface, but it does not remove the loop.

Thanks,
Jitendra

RachelGomez161999 · ‎08-04-2022

A loopback error occurs when the keepalive packet is looped back to the port that sent the keepalive. The switch sends keepalives out all the interfaces by default.

Regards,

Rachel Gomez