cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5715
Views
20
Helpful
24
Replies

Spanning-tree nightmare

GFWAFBCCO
Level 1
Level 1

Hello!

I'm having some issues with spanning tree (obviously) and trying to find the best course of action to resolve the issues. I maintain a small layer 2 network consisting of about 120 switches. Most of them are 3750's, but I do have about ten 4500s and ten 6500s. Forgive me ahead of time, but this is a military network, so I might not be able to provide complete details about these devices. In any event, here is my issue.

I have two core switches (6509's) that branch out to 5 ITNs (6509s aswell). My goal is to end up with a loop free topology with Core 1 and Core 2 as the primary and alternate root bridges respectively. Right now I have 13 switches that claim they are roots. You can imagine how this is causing issues. Oddly, the network runs well without any issues until a link to one of the afore mentioned roots goes down. We end up in a nasty STP storm that requires manually killing interfaces for a couple hours. I have pinpointed one switch that causes the most terror when a link drops.

Ignoring the rest of my topology (for now), I have Core 1 (Gi3/16) linked to Core 2 (Gi2/9) and I have my servers sitting on SrvSwitch6509. SrvSwitch6509 (Te1/2) connects to Core 1 (Te9/3) and SrvSwitch6509 (Te1/4) connects to Core 2 (Te9/4). There are 3 server Vlans (let's say Vlans 10-13). When I do a show spanning-tree root port, I see Core 1 is the root on Vlan 10, Core 2 is root on Vlan13, and SrvSwitch6509 is root on Vlan 13 aswell. I have 6 other switches claiming to be root for Vlan 12. I haven't taken any measures to fix this because I don't want to cause a storm and I would like to know if my plan is on the right track before disrupting a live network.

My plan is:

1) Manually shutdown all ports network wide that I would think should be in a blocking state

2) Configure all trunk ports on Core 1 with spanning-tree guard root

3) Configure all trunk ports on Core 2 with spanning-tree guard root except Gi2/9 (link to Core 1)

4) Configure all other switches trunks facing away from the Cores with root guard

5) Set Core 1 as spanning-tree vlan root primary for all Vlans

6) Set Core 2 as spanning-tree vlan root secondary for all Vlans

7) Once the network has converged/calmed down, begin enabling trunks one by one and ensure root guard is on outter most trunks.

I apologize for not having a diagram drawn out, but I don't have access to Visio for the time being. I've been reading every webpage, forum, and book I can get my hands on in regards to STP over the last few weeks and still feel like I am missing some key points. Am I on the right track or did I miss some crucial steps? Also, what kind of behavior should I expect and/or look out for during this process?

Thank you for your time,

SrA Steven Denham, US Air Force

Goodfellow AFB, TX

24 Replies 24

lgijssel
Level 9
Level 9

Normally, there is only one switch claiming to be the root. As you are having multiple roots, I guess that the exchange of bpdu's is not enabled on all links.

As a result, you get a different root for parts of the network. This is sometimes called: discontagous stp regions. As you have learned by experience, this is not a desirable situation to have and the effects are exctly those you describe.

This network requires a major checkup of its stp configuration. Because of the size (I would not call this SMALL) you must also consider the most optimal version of stp. You may very well end up with MST but when there are only a few vlans, you might get away with rstp.

There will be lots of people here willing to help you sort this out piece by piece (when you rate posts appropriately) but the most effective solution would be to hire someone with sufficient expertise and plan a sufficient maintenance window.

regards,

Leo

lgijssel,

BDPUs being filtered/dropped was my first guess, so I went through every switch to ensure someone didn't mistakenly put bpdufilter, bpduguard, or portfast on any of the trunks. As far as hiring someone, there would be no fun (or learning) from it. I don't know at what point you need to move from RST to MST, but I would imagine I don't need it given then numbers I have calculated for logical ports on each major switch. I mean, these are 6500s and I only have about 35 Vlans.

Thank you for your prompt reply.

Regards,

SrA Denham

The advice to hire someone was merely given because it will likely become a nightmare for your users otherwise.

Of course, if you can get away with fiddling this out yourself without anyone getting angry, then please do so.

At least it is interesting to learn about the availability requirements of the US Airforce.

Having 35 vlans per switch (total of 70?) makes using mst the optimal solution. You will be running only a few stp instances instead of 70 with rstp.

This gain becomes even bigger when the number of instances increases. So at least I would call this a valid alternative.

Further, I disagree with you on missing the opportunity to learn from this.

You could hire me (for example) and I would take as much time as needed to teach you the tricks.

I have some time left during the Christmas holidays.

Now, this might become a bit troublesome because I live not exactly round the corner but I am sure there will be others who are equally willing to provide some additional training on the job.

regards,
Leo

Leo,

I have a VTP domain that constists of about 35 Vlans, but I don't use Vlan 1 actively. If I'm not mistaken, there are still protocols that use Vlan 1, even if it is shut down. That will be a future effort once things are stable. As far as annoying users, I will try to do this during off-hours as to not interrupt work flow... assuming I don't break a switch from misconfiguring or "gentle persuasion".

Oh yes, i forgot to mention that I am in the process of pruning Vlans from trunks that don't need them. So many things to fix, so little time.

-Steve

Hi Steve,

IMHO, you should answer following questions yourself first:

Is it really a good idea to run an L2 network consisting of 120 switches?

What would be the network diameter (maximum number of switches = hops from one side to the other) then?

Wouldn't there be more reasonable to create several L2 segments connected via L3 (3750s should be able of that easily)?

If you still want to run a single L2 network, I agree with Leo:

You should find the root cause why several of your switches are believing to be a root in the same VLAN.

Are you running the same STP mode on all switches? Aren't some of them running RPVST+ while other PVST+?

Aren't there even some non-Cisco switches involved running mono-STP only?

Is VTP working properly? Are all the switches having the same configuration version? If not, is the VTP domain name and password configured correctly?

There are many similar questions coming to my mind, but the crucial one are the multiple roots - IMHO, you can't move further without fixing that.

BR,

Milan

Milan,

"Is it really a good idea to run a L2 network consisting of 120 switches?"

I don't see why not. I've been at bases with larger networks that are strictly L2 and have no issues.

"What would the network diameter be?"

Currently it is only 3 hops, but worse case scenario, 5 hops. It's a sitation where each ITN has a boatload of switches on them.

"Wouldn't there be more reasonable to create several L2 segments connected via L3 (3750s should be able of that easily)?"

I'm not 100% sure on that. I am stil a bit of a novice in this field. I was reading up on Inter-VLAN routing and the general concensus is that it can be quite a bit slower if not configured properly. I would rather resolve the issue first before going down that road.

"Are you running the same STP mode on all switches?", "Aren't some of them running RPVST+ while other PVST+?"

Yes, as far as I can tell they are all using Rapid PVST.

"Aren't there even some non-Cisco switches involved running mono-STP only?"

We have two Marconi ES1200s that will be getting replaced in the near future, but I don't suspect they are the issue.

"Is VTP working properly? Are all the switches having the same  configuration version? If not, is the VTP domain name and password  configured correctly?"

Yes, yes, and yes. Confirmed it during Vlan pruning. Using VTPv2, everything in client mode except the Cores. No switches are transparent.

"There are many similar questions coming to my mind, but the crucial one  are the multiple roots - IMHO, you can't move further without fixing  that."

That is my goal. I've been reading up on STP debug commands and how to interpret them. It appears the other roots are not recieving BPDUs on their trunks. I am not sure why since I get no max age timeouts or diameter warnings. This is why I was wanting to just define the roots and troubleshoot from there, unless this is the wrong approach.

Regards,

SrA Steve Denham, US Air Force

Goodfellow AFB, TX

Hi Steve,

I'd try some non-intrusive diagnostics first.

If mul;tiple switches believe being a root, their neighbors should be pointing to them creating "clouds" of particular roots.

Finding a switch on an edge of such a cloud (a switch which startsto point to another root) should help to find the problem cause.

Reading your original mail once more, I see:

"Ignoring the rest of my topology (for now), I have Core 1 (Gi3/16) linked to Core 2 (Gi2/9) and I have my servers sitting on SrvSwitch6509. SrvSwitch6509 (Te1/2) connects to Core 1 (Te9/3) and SrvSwitch6509 (Te1/4) connects to Core 2 (Te9/4). There are 3 server Vlans (let's say Vlans 10-13). When I do a show spanning-tree root port, I see Core 1 is the root on Vlan 10, Core 2 is root on Vlan13, and SrvSwitch6509 is root on Vlan 13 aswell. I have 6 other switches claiming to be root for Vlan 12. I haven't taken any measures to fix this because I don't want to cause a storm and I would like to know if my plan is on the right track before disrupting a live network."

Which means, you've got two core switches connected directly one to the other but both believing to be a root within VLAN13?

What does "sh spanntree int ... detail" show on the ports connecting the switches? Are BPDUs being sent/received?

(BTW, what's a TE port? Some kind of Fibre channel? I never met that personally.)

BR,

Milan

BTW, what's a TE port? >>> TenGig Ethernet.

Milan,

Attached are the outputs from:

show spanning-tree summary

show spanning-tree root port

show spanning-tree blockedports

show spanning-tree interface (int) detail

If I'm not Tx or Rx BPDUs, what can I check to see why? Debugging has a ton of information that seems a bit overly technical to the IOS.

Here are my trunk configs:

SrvSwitch6506# show run int te1/2

interface TenGigabitEthernet1/2
description Link to Core 1 (Te9/3)
switchport
switchport trunk encapsulation dot1q
switchport trunk native vlan 69
switchport trunk allowed vlan 2,3,8-10,12,14,31,32,69
switchport mode trunk
switchport nonegotiate
no ip address
end

SrvSwitch6506#show run int te1/4
Building configuration...

Current configuration : 275 bytes
!
interface TenGigabitEthernet1/4
description Link to Core 2 (Te9/4)
switchport
switchport trunk encapsulation dot1q
switchport trunk native vlan 69
switchport trunk allowed vlan 2,3,8-10,12,14,31,32,69
switchport mode trunk
switchport nonegotiate
no ip address
end

Core_1#show run int gi3/16
Building configuration...

Current configuration : 285 bytes
!
interface GigabitEthernet3/16
description Link to Core 2 (Gi2/9)
switchport
switchport trunk encapsulation dot1q
switchport trunk native vlan 69
switchport mode trunk
switchport nonegotiate
no ip address
speed nonegotiate
udld port aggressive
spanning-tree guard root
end

Core_1#
Core_1#show run int te9/3
Building configuration...

Current configuration : 312 bytes
!
interface TenGigabitEthernet9/3
description Link to SrvSwitch6506 (Te1/2)
switchport
switchport trunk encapsulation dot1q
switchport trunk native vlan 69
switchport trunk allowed vlan 2,3,8-10,12,14,31,32,69
switchport mode trunk
switchport nonegotiate
no ip address
shutdown
spanning-tree guard loop  # I know this isn't correct, but it is shutdown for now.
end

Core_2#show run int gi2/9
Building configuration...

Current configuration : 259 bytes
!
interface GigabitEthernet2/9
description Link to Core 1 (Gi3/16)
switchport
switchport trunk encapsulation dot1q
switchport trunk native vlan 69
switchport mode trunk
switchport nonegotiate
no ip address
speed nonegotiate
udld port aggressive
end

Core_2#show run int te9/4
Building configuration...

Current configuration : 302 bytes
!
interface TenGigabitEthernet9/4
description Link to SrvSwitch6506 (Te1/4)
switchport
switchport trunk encapsulation dot1q
switchport trunk native vlan 69
switchport trunk allowed vlan 2,3,8-10,12,14,31,32,69
switchport mode trunk
switchport nonegotiate
no ip address
spanning-tree guard loop  #Probably not correct... was trying different things before switch crashed
end

Regards,

SrA Steven Denham

Attached is a rough diagram of my topology. I had hostnames sanitized before, but I figure hostnames aren't critical. FYI: 501-26A is that SrvSwitch6506.

Regards,

SrA Steven Denham

So I decided to just log into all the switches that were claiming to be root and discovered one that is hanging off of 701-CC that for some reason would take priority over Core 1 and Core 2 (even though I manually set priority to 32k). After playing with it a bit, enabling spanning-tree root guard on 701-CC, and pruning uneeded VLANs, it still wouldn't give it up.

So I rebooted it. Fixed.... very strange. I ensured running and startup config were synced before I rebooted too! Now to go through all the others and see what is going on. Now I understand root bridge is decided by comparing priority and ID (mac) of each device on the network. I assume priority is compared first, if so, why would this switch be so rebellious (3560 24-port)?

More updates to come.

Regards,

SrA Steven Denham

Hi Steve,

yeah, that's definitely a good idea to check all the switches believing to be a root incorectly.

There might be an IOS bug requesting a switch reboot to apply a new STP priority - I remember similar problem several years ago.

Looking to the outputs you provided:

I realised there's VLAN13 disabled on the trunk between your Core2 and  501-26A switches.

So the switches can't exchange VLAN13 STP BPDUs through that trunk and both can believe to be the VLAN13 root (if there is no other continuous path within VLAN13 connecting them).

BR,

Milan

Milan,

Thank you for the heads-up. I overlooked that. Not a big-deal as this VLAN is used as the blackhole vlan (unused access ports). I will fix it though. Now a new one that has me scratching my head. On my topology you will see 3453-168 (bottom left). Off of that switch I have a 4506, 3543-211. On the trunk going out of 3453-168, I pruned all but vlans 9,14, and 69. On the other side of the link I didn't prune anything (as I thought it wouldn't matter). Why is it 3543-211 would not release root from all vlans until I pruned the vlans off of its trunk? I was showing BPDUs Tx/Rx on that interface. I even applied root guard to that interface first and it still kept root. Do I have a batch of bad IOSes?!? Granted, I haven't updated all of these yet, but that will be top priority once things are stable.

So my process so far is:

1) Remove any spanning-tree vlan priority commands from distand end switches

2) Ensure pruning is done on both sides of link

3) Enable root guard on ITN/Switch above it's trunk

4) Recheck STP root

I find it strange that vlan pruning would have any affect on STP root election. Am I correct in thinking that BPDUs are still traversing the link even if the vlan is pruned?

Regards,

SrA Steven Denham

Hi Steve,

I've got a feeling you might be mixing two terms:

VTP pruning (http://www.cisco.com/en/US/customer/tech/tk389/tk689/technologies_tech_note09186a0080094c52.shtml#vtp_pruning )

and

VLAN disabling on a trunk.

When you enable VTP pruning, STP is not effected at all.

But disabling a VLAN on  trunk disables BPDU transfer for that VLAN, too.

Also I'm not sure if understand your "I even applied root guard to that interface first and it still kept root. " correctly.

Root guard should be applied on the other side of the trunk to prevent the fake root BPDUs to be forwarded to the LAN.

See http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml .

HTH,

Milan

Review Cisco Networking for a $25 gift card