08-04-2022 07:10 AM
Hello Everyone,
We were introducing a new pair of switches into the network. MSTP was configured on both switches and all around the network.
There was a loop between those 2 switches when we were configuring Etherchannel and some of the ports around the network were error-disabled due to Loopback Error Detected. I understand that keepalive mechanism was in charge of this, I just dont understand, how keepalives were faster than STP being kicked-in on those switches where the loop occurred.
Is it possible that keepalives propogated faster than STP detected a loop?
Solved! Go to Solution.
08-05-2022 05:24 AM
Hello,
I just double checked with our network engineer, first he configured Etherchannel on both switches, but somehow it did not work out on one of the switches - ports went to err-disable state.
Aha. I suspect that the error message there looked something similar to this one:
*Aug 5 11:45:28.681: %PM-4-ERR_DISABLE: channel-misconfig error detected on Po1, putting Et0/0 in err-disable state
Would I be correct?
This would be the EtherChannel Misconfig Guard jumping into action. In short, the EtherChannel Misconfig Guard observes the source MAC address of the STP BPDUs received across an EtherChannel. If the neighboring switch sends them from an active EtherChannel itself, they will all be sourced from the same MAC. However, if the neighboring switch is not bundling the ports into an EtherChannel and continues to treat them as individual switchports, the source MAC addresses of the sent BPDUs will be different. EtherChannel Misconfig Guard detects this and eventually err-disables the receiving EtherChannel to prevent bad things from happening.
Most often, this happens when one switch is configured for the static EtherChannel ("channel-group ... mode on") while the other is not configured yet. It takes less than a minute for EtherChannel Misconfig Guard to trip. I suspect that this is what happened initially.
Then he removed etherchannel config completely from this switch only, leaving ethernet cables connected.
Aha. And the other switch kept the EtherChannel configured? Because then one scenario for a permanent switching loop would be this one:
Let's assume that the root switch is one of Sw1, Sw2, or Sw3 (any of them would do). If Sw3 had the two links to Sw4 still statically bundled in an EtherChannel, it would be sending out only one BPDU for the entire EtherChannel, not one per link, and the BPDUs would be carried by only one of the two links, always the same one. Let's assume that it's the top one in the diagram. So, to Sw4 that has the EtherChannel removed and the two links operating individually, the port receiving the BPDUs would be the Root port, and the other port that does not receive any BPDUs would eventually become Designated port - and both Root and Designated on Sw4 would be Forwarding. Note that this situation would be stable and STP would be unable to detect, prevent, or resolve it. It would stay indefinitely.
So here, in this situation, Sw4 would be having two Forwarding ports connected back to Sw3, and the disaster is done - if, after a MST TC event and global flush of the CAM tables, the keepalive from Sw1 gets flooded through Sw2 to Sw3 and to Sw4, then it will loop back to Sw3 and get switched back to Sw2 to Sw1... et voilà.
Does this seem like matching the sequence of events seen across your switched network?
Best regards,
Peter
08-04-2022 09:23 AM
Hello,
If I may ask - to be sure about the exact event - does any of the switches that experienced the err-disabled ports still have the event logged in their logs ("show logging")? If so, could you share those lines from one or two switches?
In particular, are you certain this was due to loopback error, and not due to EtherChannel Misconfiguration Guard? If you configured a static EtherChannel (no PAgP, no LACP, only "mode on"), that one could be more likely.
Thank you!
Best regards,
Peter
08-04-2022 11:14 PM
Hello Peter,
The error log was %PM-4-ERR_DISABLE: loop-back error detected on Gi2/0/48, putting Gi2/0/48 in err-disable state
SW1 - SW2 - SW3--SW4
Etherchannel was being configured between SW3 and SW4, Keepalive error occured on SW1, with uplink to SW2
Looks like keepalive propogation from SW1 was faster, than STP action on SW3 and SW4.
Is it possible?
08-05-2022 02:38 AM
Hello,
I suspect that this event must have been a consequence of a series of events, not a single one. Facts to consider:
This theory does have a few holes but so far, I can't come up with any better one.
Looks like keepalive propogation from SW1 was faster, than STP action on SW3 and SW4.
That wouldn't be how I know the operation of the switches. When ports are down, their switching logic is programmed to block all VLANs. When they come up, they come up as blocked already, and it must in fact be MSTP deciding that it's safe to unblock them. The only reason a switchport would come up implicitly unblocked would be having it configured as a PortFast (that is, edge) port. I certainly hope you have not configured your connection between Sw3 and Sw4 like that.
Can you comment on the exact types of switches and their operating system versions? What exact types of switches are Sw1 through Sw4, and what software is running there?
Best regards,
Peter
08-05-2022 04:02 AM
Hello Peter,
Thank you for a detailed explanation. In fact, I was thinking in the same direction, that keepalive packet from SW1 travelled all way long to SW2-SW3-SW4 and was looped back from SW4 to SW3-SW2 and eventually back to SW1 on the same port, where it was originated from. This event must have happened because CAM tables on the switches in this chain were flushed due to TCN event of a port changing its status to UP when etherchannel was being introduced. I just double checked with our network engineer, first he configured Etherchannel on both switches, but somehow it did not work out on one of the switches - ports went to err-disable state. Then he removed etherchannel config completely from this switch only, leaving ethernet cables connected. I believe this is what caused this chain of events. However, still no clue how keepalives were faster to traverse than STP detecting a loop between SW3 and SW4
08-05-2022 04:12 AM
BPDU is process and modify in each SW BUT the Loopback is small message forward without any process.
and sure you have L2 Loop since the Loopback massage receive from same port send it.
08-05-2022 05:24 AM
Hello,
I just double checked with our network engineer, first he configured Etherchannel on both switches, but somehow it did not work out on one of the switches - ports went to err-disable state.
Aha. I suspect that the error message there looked something similar to this one:
*Aug 5 11:45:28.681: %PM-4-ERR_DISABLE: channel-misconfig error detected on Po1, putting Et0/0 in err-disable state
Would I be correct?
This would be the EtherChannel Misconfig Guard jumping into action. In short, the EtherChannel Misconfig Guard observes the source MAC address of the STP BPDUs received across an EtherChannel. If the neighboring switch sends them from an active EtherChannel itself, they will all be sourced from the same MAC. However, if the neighboring switch is not bundling the ports into an EtherChannel and continues to treat them as individual switchports, the source MAC addresses of the sent BPDUs will be different. EtherChannel Misconfig Guard detects this and eventually err-disables the receiving EtherChannel to prevent bad things from happening.
Most often, this happens when one switch is configured for the static EtherChannel ("channel-group ... mode on") while the other is not configured yet. It takes less than a minute for EtherChannel Misconfig Guard to trip. I suspect that this is what happened initially.
Then he removed etherchannel config completely from this switch only, leaving ethernet cables connected.
Aha. And the other switch kept the EtherChannel configured? Because then one scenario for a permanent switching loop would be this one:
Let's assume that the root switch is one of Sw1, Sw2, or Sw3 (any of them would do). If Sw3 had the two links to Sw4 still statically bundled in an EtherChannel, it would be sending out only one BPDU for the entire EtherChannel, not one per link, and the BPDUs would be carried by only one of the two links, always the same one. Let's assume that it's the top one in the diagram. So, to Sw4 that has the EtherChannel removed and the two links operating individually, the port receiving the BPDUs would be the Root port, and the other port that does not receive any BPDUs would eventually become Designated port - and both Root and Designated on Sw4 would be Forwarding. Note that this situation would be stable and STP would be unable to detect, prevent, or resolve it. It would stay indefinitely.
So here, in this situation, Sw4 would be having two Forwarding ports connected back to Sw3, and the disaster is done - if, after a MST TC event and global flush of the CAM tables, the keepalive from Sw1 gets flooded through Sw2 to Sw3 and to Sw4, then it will loop back to Sw3 and get switched back to Sw2 to Sw1... et voilà.
Does this seem like matching the sequence of events seen across your switched network?
Best regards,
Peter
08-04-2022 11:32 AM
loopback meaning there are L2 Loop.
check the Spanning tree FWD status
check PortChannel are all member "P" or "S"
08-05-2022 05:41 AM
addition to @Peter Paluch said before and it 100% correct
check this link for more detail about how misconfig portchannel make Loop
https://www.dasblinkenlichten.com/port-channel-loops/
08-04-2022 10:40 PM
this is possible to meet with the bug.
https://quickview.cloudapps.cisco.com/quickview/bug/CSCea46385
Keepalives are sent on all interfaces by default in Cisco IOS Software Release 12.1EA-based software. In Cisco IOS Software Release 12.2SE-based software and later, keepalives are not sent by default on fiber and uplink interfaces. For more information, refer to Cisco bug ID CSCea46385 ( registered customers only) .
The suggested workaround is to disable keepalives and upgrade to Cisco IOS Software Release 12.2SE or later.
08-04-2022 10:46 PM
Temp Solution
no keepalive
interface command in order to disable keepalives. A disablement of the keepalive prevents errdisablement of the interface, but it does not remove the loop.
08-04-2022 11:35 PM
A loopback error occurs when the keepalive packet is looped back to the port that sent the keepalive. The switch sends keepalives out all the interfaces by default.
Regards,
Rachel Gomez
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide