Solved: Re: How does a loop form in a misconfigured Etherchannel? - Page 3

Peter Paluch · ‎12-04-2010

Dear friends,

It is a commonly seen and practically proven issue that if two switches are interconnected by a number of parallel links which are bundled into an Etherchannel on one switch (obviously using the on mode) while being unbundled on the second switch, a Layer2 loop may well be created. However, I do not understand the exact mechanism of the formation of this loop.

I am well aware of the basic principles behind: I know that STP treats the Port-channel interface as a single interface, and thus all member links bundled in this Etherchannel share the same STP state/role. I also understand that a broadcast/multicast/unknown unicast frame sent by a port in the Etherchannel will reach the opposite switch and get flooded over all remaining links, eventually arriving back at the switch with the configured Etherchannel.

And this precise moment is where my understanding ends: The frame arrived back and its destination is still unknown. However, from the viewpoint of the switch, the frame came in through a particular Port-channel interface. If this switch floods the frame, it will flood it through all remaining ports except the port through which the frame came in, meaning that the frame will never be sent back through the Port-channel. How does the loop get created, then?

Thank you very much for helping me out with this!

Best regards,

Peter

Jon Marshall · ‎12-05-2010

Peter

A viewpoint, the situation is no different, and yet, no looping ensues - even with physical address learning, otherwise the Etherchannel would be unusable per se.

That's a good point, it would be no different than an etherchannel receiving an unknown unicast or broadcast for that matter from a fully functioning etherchanel at the other end.

Jon

Jon Marshall · ‎12-05-2010

I don't understand how you can be confused.

Ohh, that can happen all too easily

No problem, must be the way i was reading it.

Jon

Edison Ortiz · ‎12-05-2010

r-godden · ‎12-05-2010

I have come across loops twice at different customer sites caused by using mode on, desirable is the recomendation on 45006500 best practices ( not sure if available to non partners). Just stear clear of mode on as it has this hidden feature evrey now and then.

lamav · ‎12-05-2010

"That is just an extension of the usual rule that a frame is never going to be sent back its own ingress port.."

Just an FYI, this basic rule is changing with VEPA aka "VEB in tne switch." In such a case, the adjacent hardware bridge will switch traffic between VMs, as opposed to the software bridge in the hypervisor.

Victor

Peter Paluch · ‎12-06-2010

Edison, Jon and everybody,

I think I've got it. Please bear with me while I explain my thoughts.

As we have discussed here, a pure Etherchannel between just two switches, with one switch being configured for Etherchannel and the other not, may result in frames being reflected (bounced) back but not in frames going around in circles. A frame received on a member port of an Etherchannel bundle will not be forwarded back through any member port of the same bundle, thus, the loop cannot be formed that way. I have tested this in our lab - I have connected two switches together with two links, one switch was configured for Etherchannel, the other not. No broadcast storm occured as the result although I have intentionally flooded the topology with broadcast traffic (a videostream sent to a broadcast IP address). As soon as I stopped transmitting the stream, the topology remained silent - no frames got caught in a loop, and no network collapse ensued whatsoever.

What I realized is that the loop must be formed by an additional redundant link in the topology that somehow - when combined with a misconfigured Etherchannel - results in multiple paths between switches, thus forming a loop. The networks we have seen to collapse under Etherchannel misconfiguration were always more redundant than just a single Etherchannel between switches. I began to suspect that there must be additional redundancy present in the network so that the loop can be formed.

I have therefore analyzed the simplest scenario fulfilling this requirement - adding a third standalone link between the two switches. To better visualize the concept, please have a look at the exhibit Example1.png I have attached. There are two switches - the Distribution Switch (DS) and the Access Switch (AS). The standalone link is the Fa0/1. Furthermore, DS has ports Fa0/2-3 bundled in an Etherchannel while AS has no Etherchannel configured. The DS is configured as STP Root. What will now happen?

Because DS is STP Root, all its ports (Fa0/1 and Po1) are Designated Forwarding
AS will receive BPDUs via Fa0/1 and via exactly one of the links of the Etherchannel from DS. Let's assume that the link is Fa0/2
AS will declare Fa0/1 to be its Root port (the lowest sender port ID) and Fa0/2 as Alternate Discarding.
However, AS receives no BPDUs on Fa0/3. Therefore, it declares the Fa0/3 as Designated Forwarding.

And voila! - we have a loop here - two links completely unblocked and forwarding: Fa0/1 and Fa0/3. A single broadcast now starts the usual broadcast storm that I've have striving for so long!

And then it all suddenly began making sense. The true loop with frames endlessly circulating is not actually created by the presence of the misconfigured Etherchannel itself but rather by the modified operation of STP over an Etherchannel bundle - that the BPDUs for a particular VLAN are sent through a single link in the entire bundle. All other bundled ports that do not carry BPDUs can be mistakenly considered as eligible for Designated Forwarding by the switch with a missing/misconfigured Etherchannel, and that forms the basis of the actual loops.

I have subsequently analyzed a rather common scenario with two distribution layer switches and an access switch connected to both distribution switches. Please see first the Example2.png. The DS1 is configured as STP Root, the DS2 is configured as STP Secondary Root. In this exhibit, the AS has the ports Fa0/1-2 unconfigured by mistake (or not yet). Assuming that the bundle on AS towards DS2 becomes the root port (Etherchannel has a lower cost than individual links so with enough links in an Etherchannel of an appropriate speed, this may happen by default), one of the ports Fa0/1-2 on AS becomes Alternate Discarding (because it receives BPDUs from DS1) and the other becomes Designated Forwarding as it receives no BPDUs. A loop is thus formed.

The Example3.png depicts another common scenario with ports Fa0/5-6 unbundled on AS. DS1 is again STP Root, DS2 is STP Secondary Root. Here, the bundle on AS towards DS1 will be declared Root port, and because the DS2 has a lower BID than AS (it is Secondary Root), its bundle towards AS will be declared as Designated Forwarding. Again, AS will declare one of the ports Fa0/5-6 as Alternate Discarding and the other as Designated Forwarding, and here we have the loop again.

I assume this is actually what brought down the networks with misconfigured Etherchannels.

I have experimentally tested all three scenarios in our lab and I have been able to easily create the broadcast storm in each of these cases. Furthermore, I have not deactivated the STP Etherchannel Misconfig Guard. Even with this guard left active, there was absolutely no problem in creating these loops as described earlier. The reason is that this guard basically reacts to arrival of BPDUs sent from differing MAC addresses on ports bundled in an Etherchannel which is not expected in correct configuration. However, in my particular topology, each bundle consisted of two links. Whenever AS declared one of these links as Alternate Discarding and the second as Designated Forwarding, the Etherchannel received BPDUs only via the Forwarding link and so the EC Misconfig Guard had no reason to kick in. It would be probably different if my bundles consisted of at least 3 links but for two-link bundles, this guard is helpless (which is logical - it performs only local decisions and so has very limited information).

Phew I would like to thank ANYONE that has joined this thread so far and helped me to finally resolve this mystery!

Best regards,

Peter

Edison Ortiz · ‎12-06-2010

Peter,

Excellent job on the testing and explanation. Rated as deserved

Giuseppe Larosa · ‎12-06-2010

Hello Peter,

after more then 30 posts I have found an interesting note:

>> And then it all suddenly began making sense. The true loop with frames endlessly circulating is not actually created by the presence of the misconfigured Etherchannel itself but rather by the modified operation of STP over an Etherchannel bundle - that the BPDUs for a particular VLAN are sent through a single link in the entire bundle. All other bundled ports that do not carry BPDUs can be mistakenly considered as eligible for Designated Forwarding by the switch with a missing/misconfigured Etherchannel, and that forms the basis of the actual loops.

Rated as deserved

PS: I had tested the keepalive on FE ports of a C3750 ME and I can say they are IEEE LOOP frames, if you remember an old thread of some mounths ago we discussed about this.

Best Regards

Giuseppe

Peter Paluch · ‎12-06-2010

Hi Giuseppe,

Thank you very much, I appreciate it!

PS:  I had tested the keepalive on FE ports of a C3750 ME and I can say 
they are IEEE LOOP frames, if you remember an old thread of some mounths
 ago we discussed about this.

Yes, I remember, this was the discussion:

https://supportforums.cisco.com/message/3005684

The summary of the entire discussion on my part was that the LOOP frames seemed to be only used to detect self-looped ports. Even though the original Ethernel Configuration Test Protocol was intended to have more functions, they did not seem to be implemented/used on Cisco switches. Arrival of a LOOP frame on the port from which it was originally sent resulted in err-disabling the port. Non-arrival of a LOOP frame was considered correct.

Have you been able to find out anything new/different regarding the LOOP protocol on the C3750 ME? (Interesting enough, the LOOP protocol does not seem to be covered by any IEEE standard - all current information suggests that it was part of the original Ethernet v2.0 by DIX but it was not retaken by IEEE).

Thanks!

Best regards,

Peter

Jon Marshall · ‎12-06-2010

Peter

Really great work and great explanation too. +5 isn't really enough for all the effort.

Jon

will · ‎12-28-2012

nice post and thx for the testing. helps a lot to understand etherchannel a bit more!

wpalumbo06 · ‎01-10-2014

Thanks Peter,

I always enjoy reading your posts and it was nice to meet you last year at the Cisco RTP campus. I encountered this problem twice last year (one occurrence in a Data Center - ouch) and regardless of the science behind the issue, it's a very nasty and disruptive problem that is easily avoided by using LACP or PAgP when possible. I too, like to know the underlying reasons behind problems like this and your post greatly helped me understand something that I knew to be a problem but didn't really know why it is was a problem. I believe understanding issues like this at the root level makes us all better engineers and ultimately helps us to build better networks!

Thanks again for the great contributions to these forums.

Bill

devils_advocate · ‎01-10-2014

I know this is a three year old thread but I had not come across it before.

Great read for a lazy Friday in the office!!

Jon Marshall · ‎01-11-2014

Peter

I forgot all about this thread. I said it deserved more than just a rating and now i can

Jon

Julio Carvajal · ‎01-11-2014

Hello Peter,

Wow,

What an amazing Post!!

Kudos to you.

You should do a document about this, it would be great.

Cheers,

Julio Carvajal Segura
http://laguiadelnetworking.com

Julio Carvajal
Senior Network Security and Core Specialist
CCIE #42930, 2xCCNP, JNCIP-SEC