Re: Loops in an STP-enabled network - How?

molnarattila1221 · ‎03-26-2023

Hello,

I've got a question about STP/RSTP (it applies to both, but I'll just say STP). If I connect several switches and the STP algorithm eventually creates a loop-free environment, then how can a single new switch cause a loop?

Let's look at an example scenario:
Let's say I have 50 switches, and a user buys a cheap switch without STP (or an attacker turns it off), and connects two ports on the cheap switch to each other (creating a loop). If this cheap switch is then connected to a single port on one of my switches (this port is on Switch 1, and it has PortFast but not BPDU Guard enabled), and the cheap switch starts sending broadcast frames, those broadcast frames enter my switched network, which is a loop-free environment. So how can they cause a loop? The frames must eventually terminate on one of my switches (because STP creates a single logical path for frames), so they can't circle around from the last switch that they reach (Switch 50) to the first switch (Switch 1) through which they initially entered my switched environment.

The only answer I could come up with is that they can't cause a loop, but they can take down the network. They do it in 2 ways:

1) They drown out all traffic, so no other frames can be forwarded, because the rogue, cheap switch's broadcast frames take up all of the available bandwidth.

2) Because the rogue switch's user frames take up all available bandwidth, both all other user frames and the BPDU frames of my switches as well get drowned out. This includes the BPDU Hellos, so eventually the Max Age timer expires, and the switches must recalculate STP. While the STP calculations are going on, all user frames (including the broadcast frames of the rogue, cheap switch) are filtered. While the STP calculation is going on, they still keep trying to enter my switched environment, and after the STP calculation is done, they can do so a second, third, etc. time, taking down the network again.

Or is it the case that the switches' CPUs get so busy, they don't even have an opportunity to keep track of the Max Age timer's expiry? If that's true, then the only answer is 1).

Can someone please help me understand this? Is there actually a way to create a loop in an STP-enabled environment, or is the harm done in the 2 ways I describe?

Have a nice weekend.
Attila

Flavio Miranda · ‎03-26-2023

Hi

I think I got your point and the answer is Yes, the caos will installed on this case cause STP is a mechanism to prevent loop but it will not supress a loop in case you configure it incorrect or maliciosly as you mentioned.

If the loop is not avoid by correctly placing switches and STP configuration and broadcast start being spread away the switches will do what they are meant to and will forward frames

Of course it depends how your network is segmented. If you have stack of switches connected to distribuition switches probably this will crash only the stack.

I witnessed a whole stack of 3650 crashing cause some user plugged two cables on the same Phone.

Thats why network design is so important.

molnarattila1221 · ‎03-26-2023

Hi Flavio,

Thanks. The only way I can see to create a loop is for a Blocking port be somehow forced into a Forwarding state. Are you also saying that this can't happen?

MHM Cisco World · ‎03-26-2023

The bpdu even if it send as multicast but it not flood from one port to other in cheap sw'

This make cheap sw receive bpdu from one side and drop and other side never seen this.

This make both side connect to cheap sw assume that it connect to host

This behavior is different for data multicast/broadcast/unknown unicat frame which cheap SW flood it fron one port to other and hence loop happened.

molnarattila1221 · ‎03-26-2023

Hello MHM,

But doesn't the original, single logical path for frames remain the same while the network is being flooded? The only way I can see to break this single path up, and to create a loop is for a Blocking port be somehow forced into a Forwarding state. But this can't happen, right?

MHM Cisco World · ‎03-26-2023

Blocking port be somehow forced into a Forwarding state

SW1(STP)-SWx(no-STP)-SW2(STP)------SW1(STP)
this design
and no-STP SWx drop the BPDU
SW1(STP) and SW2(STP) see the BPDU only from direct connect no via no-STP SWx
hence both SW1/2 FWD the interface even if there is LOOP
and here the LOOP is happened and broadcast send from SW1 to SWx then to SW2 will return to SW1 and SW1 re-flood it again and this sure Loop.

molnarattila1221 · ‎03-26-2023

Hi,

Thank you. Then in your setup, the ports were originally in Forwarding state, not Blocking, so Blocking ports weren't changed into Forwarding state, correct?

Can a loop occur with this topology?:

SWx(non-STP)-SW1(STP)-SW2(STP)-SW3(STP)----SW1(STP)

So the STP-speaking switches are connected in a triangle, and the non-STP switch is connected to only SW1, and only 1 interface of SW1.

Attila

MHM Cisco World · ‎03-26-2023

sure can,
the SW1 send BPDU through both port connect to SWx(no-STP) but not receive any BPDU
and then it assume it connect to Host not to SW and put the port to FWD
this make SW1 send broadcast through port connect to SWx and receive it from other port connect to SWx
this make loop never end.

but can BPDUguard protect SW1 from this Loop? No, from it name BPDUguard it technique depend on receive the BPDU, here SW1 never receive BPDU from SWx and BPDUguard can not protect it.

molnarattila1221 · ‎03-26-2023

Hi,

Thanks.

"this make SW1 send broadcast through port connect to SWx and receive it from other port connect to SWx"

In my setup, SWx connects to SW1 through only 1 port. Not 2 ports.

If 2 ports on SWx are connected (eg F0/1 and F0/2), and a 3rd port connects to SW1 (F0/3), and SWx receives a broadcast frame through F0/4, then the broadcast frame will loop between F0/1 and F0/2, and it will be sent to SW1 via F0/3, but it won't be sent back by SW1 to F0/3. SW1 won't send it back to F0/3 because then it would be forwarding it through the port that it came in. Or did you have something else in mind?

Attila

MHM Cisco World · ‎03-26-2023

SWx(non-STP)-SW1(STP)-SW2(STP)-SW3(STP)----SW1(STP) <<- this SW1(STP) in end not connect again to SWx ?
if it connect then this case I explain above

other Case
SWx have two port cross connect (my mistake)
SWx connect to SW1 with only one port

you assume the broadcast from one of two port of SWx that cross connect
but assume the broadcast from SW1
SWx receive it forward to all ports (include the corss connect ports)
the SWx now receive the broadcast again from the cross connect ports and this make it forward to Sw1, here loop effect other SW.

molnarattila1221 · ‎03-26-2023

Thank you for trying to help me make sense of all of this.

"SWx(non-STP)-SW1(STP)-SW2(STP)-SW3(STP)----SW1(STP) <<- this SW1(STP) in end not connect again to SWx ?"

Nope, it doesn't connect again to SWx. As I've written originally:

"So the STP-speaking switches are connected in a triangle, and the non-STP switch is connected to only SW1, and only 1 interface of SW1."

So SWx connects to SW1 via interface F0/3, and only F0/3.

MHM Cisco World · ‎03-26-2023

check above

molnarattila1221 · ‎03-26-2023

Hi,

Yes, I agree that in the second case you describe, there will be a continuous broadcast. But: the topology of my switches won't change, right? So the ports won't change states, right? The "only" negative consequence is going to be that my switches can't forward any other frames, except the rogue frames by SWx, correct?

MHM Cisco World · ‎03-26-2023

So the ports won't change states, right? You meaning the STP? if Yes then the STP status of port not change.

Note:- always remember that STP depend on BPDU to detect loop when BPDU missing the STP is useless
The "only" negative consequence is going to be that my switches can't forward any other frames, except the rogue frames by SWx, correct? the broadcast storm will congestion all port in your network, we can not predict which protocol will effect by this case.

Joseph W. Doherty · ‎03-26-2023

Since most, if not all recent, Enterprise switches are "wire speed", unsure a single port's "flood" could take down upstream switches. Otherwise, any host, sending to its port, traffic at full wire speed, even if all broadcast traffic, might cause the same impact (or so it would seem).

The forgoing, though, might not apply to all switches. For other switches, your example, or my hypothetical host, might, indeed, crush part of your network. (When Code Red struck a large scale Enterprise network, that I was supporting, our [10 to 30 Gbps capable - huge then] multi gigabit port core L3 switches melted down. What was interesting, it wasn't the volume of the traffic that took the switches down, it was the the nature of Code Red traffic, transiting the switches, that took them down. [Recreating the same "nature" of traffic, in a Python script, I was able to melt down one of our core L3 switches with only about 3.5 Mbps of traffic! {I won't go into the details of the "nature" of the traffic, as I don't want to provide DoS instructions.}] Years later, I, once again, saw a similar meltdown impact due to a flapping link, due to OSPF SPF recalculation.)

In separate testing, I discovered, 3750G L3 switches, could easily overload their CPUs, with very little processing. Basically, I discovered many L3 switches often have a low capacity CPU because normally, the bulk of the switch's capacity is found supporting its data plane, not its control plane.

The forgoing is just a way of saying, it's possible some kinds of traffic floods, or even some kinds of low volume traffic, can cause a switch to meltdown, or might not too. I.e. your specific example, might cause anything from no adverse impact at all, to every switch in that same L2 domain might meltdown (the latter taking any other L2 and/or L3 with it, on that switch).

If you're going to have a problem, I can "see" situations where lots of looped traffic, i.e. the volume, might cause a issue, but more likely, I think, is the nature of the traffic, possibly combined with its volume, might cause a problem.

Such potential problems are why, I believe, new design guides lean more toward L3 and/or smaller L2 domains. If you're going to hit a problem, at least we can try to minimize its overall impact.