cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5141
Views
20
Helpful
24
Replies

Spanning-tree nightmare

GFWAFBCCO
Level 1
Level 1

Hello!

I'm having some issues with spanning tree (obviously) and trying to find the best course of action to resolve the issues. I maintain a small layer 2 network consisting of about 120 switches. Most of them are 3750's, but I do have about ten 4500s and ten 6500s. Forgive me ahead of time, but this is a military network, so I might not be able to provide complete details about these devices. In any event, here is my issue.

I have two core switches (6509's) that branch out to 5 ITNs (6509s aswell). My goal is to end up with a loop free topology with Core 1 and Core 2 as the primary and alternate root bridges respectively. Right now I have 13 switches that claim they are roots. You can imagine how this is causing issues. Oddly, the network runs well without any issues until a link to one of the afore mentioned roots goes down. We end up in a nasty STP storm that requires manually killing interfaces for a couple hours. I have pinpointed one switch that causes the most terror when a link drops.

Ignoring the rest of my topology (for now), I have Core 1 (Gi3/16) linked to Core 2 (Gi2/9) and I have my servers sitting on SrvSwitch6509. SrvSwitch6509 (Te1/2) connects to Core 1 (Te9/3) and SrvSwitch6509 (Te1/4) connects to Core 2 (Te9/4). There are 3 server Vlans (let's say Vlans 10-13). When I do a show spanning-tree root port, I see Core 1 is the root on Vlan 10, Core 2 is root on Vlan13, and SrvSwitch6509 is root on Vlan 13 aswell. I have 6 other switches claiming to be root for Vlan 12. I haven't taken any measures to fix this because I don't want to cause a storm and I would like to know if my plan is on the right track before disrupting a live network.

My plan is:

1) Manually shutdown all ports network wide that I would think should be in a blocking state

2) Configure all trunk ports on Core 1 with spanning-tree guard root

3) Configure all trunk ports on Core 2 with spanning-tree guard root except Gi2/9 (link to Core 1)

4) Configure all other switches trunks facing away from the Cores with root guard

5) Set Core 1 as spanning-tree vlan root primary for all Vlans

6) Set Core 2 as spanning-tree vlan root secondary for all Vlans

7) Once the network has converged/calmed down, begin enabling trunks one by one and ensure root guard is on outter most trunks.

I apologize for not having a diagram drawn out, but I don't have access to Visio for the time being. I've been reading every webpage, forum, and book I can get my hands on in regards to STP over the last few weeks and still feel like I am missing some key points. Am I on the right track or did I miss some crucial steps? Also, what kind of behavior should I expect and/or look out for during this process?

Thank you for your time,

SrA Steven Denham, US Air Force

Goodfellow AFB, TX

24 Replies 24

Milan,

Sorry, I meant the opposite side of the trunk. Also, you are correct in my thinking VTP pruning and vlan pruning were similar. I see the difference now. i am slightly confused by the usage of "packet" and "frame" in the same scope though. Is a VTP packet still layer 2 and they are calling it a packet for simplicity?

Would it be benefitial (or even wise) to remove vlans I want to "prune" from a switch via no spanning-tree vlan #-##? What I am getting at is if a switch doesn't need to know STP information about a vlan (because its users aren't assigned to it), should I still allow that switch to participate in the STP for that vlan? For example, 3453-168 and its sub-switches are only using vlans 9 (management), 14 (users), and 69 (trunking), I would considering removing all other vlans from STP on that switch.

Regards,

SrA Steven Denham

Hi Steve,

yes, IMHO,  a VTP packet is still layer 2 and they are calling it a packet for simplicity.

ad "Would it be benefitial (or even wise) to remove vlans I want to "prune" from a switch via no spanning-tree vlan #-##?"

Be very careful here!!

no spanning-tree vlan x

would disable STP in VLANx while keeping VLANx on the trunks, which is quite dangerous and you should NEVER do that!

See http://www.cisco.com/en/US/customer/tech/tk389/tk621/technologies_tech_note09186a00800951ac.shtml#keep_stp

If you want to use only limited number of VLANs on your edge switches, you could disable the unnecessary VLANs on trunks using the

switchport trunk allowed vlan x,y,z,...

command.

Or you could think about removing the switch from the VTP domain and configure the small number of VLANs locally on  the switch.

But the best practice is "not disabling STP unless it's absolutely necessary!".

BR,

Milan

Milan,

I see. I've been using switchport trunk allowed vlan (range) as my pruning method. I guess I was on the right track after all. I will definately read over this throughly when I'm stable. I saw a quick excerpt in there about L3 switching which looks intriguing. I thank you for all your advice thus far. It appears that as I delve deeper into this mystery, I am finding more and more problems. I suppose that is to be expected when you go over your network with a fine tooth comb. Time for some much needed rest.

One last thing, on my links from Core_1 and Core_2, why do I get a network outtage when I apply root guard to both links towards 501-26A (our server switch)? If I remove guard loop on the server side, everything should come up fine, right?

Regards,

SrA Steven Denham

Hi Steve,

ad the root guard)

If you read the feature description carefully (see http://www.cisco.com/application/pdf/paws/10588/74.pdf ), the behaviour is:

"Root guard does not allow the port to become an STP root port, so the port is always STP-designated. If a better BPDU arrives on this port, root guard does not take the BPDU into account and elect a new STP root. Instead, root guard puts the port into the root-inconsistent STP state."

"This root-inconsistent state is effectively equal to a listening state. No traffic is forwarded across this port. In this way, the root guard enforces the position of the root bridge."

So in your case:

If you apply root guard on  both links from Core_1 and Core_2 towards 501-26A, what happens?

Based on the root position, either Core1 or Core2 receives a superior (better) BPDU on it's port and puts the port to  the root-inconsistent STP state.

In the worst case when  501-26A is the root, both Core_1 and Core_2 put their ports  to  the root-inconsistent STP state and 501-26A is not reachable at all :-(

IMHO, you should not need this feature at all within a LAN you are administering yourself.

You should always be able to configure the root where you want it by decreasing the STP priority of the chosen switch.

BR,

Milan

No disrespect intended for the Airman posting this question, but given the horrific shape of this network, I have a feeling it has been designed and managed by military personnel for the last several years. Military training and the technical acumen of military personnel are both woefully inadequate, as my many years in the military taught me. Were it not for the civilian contractors, the military would be in serious trouble.

ex-engineer,

I like to think I am a cut above when it comes to the technical side of my job, but I do agree with you.

Milan and others,

I managed to find the problem and solve it. It was many poorly designed things all conflicting with each other and causing the problem. First problem was revealed when one of the Vlans stopped passing traffic. The line and protocol showed up, but no traffic would traverse the vlan. After digging through the switches having the issue, I learned that said vlan would show nothing when I did a show spanning tree root. It had the correct interface and ID listed for the other vlans. This made it extremely difficult to track down the offending devices. When I looked at the priority of the root bridge on that switch, it showed a value of 1 with a bridge ID of 0001.0000.0000.0000. So, I looked at the ITN it was connected to and it also showed the same thing. I then kept tracing back until I found a switch with the proper information.

I began troubleshooting be enabling root guard on the ITN's outbound interface towards the offending switches. Once I did that, the ITN recieved the proper STP information. I finally traced it all the way down to a server that was setup on a trunk port with no native vlan specified. I compared the MAC to the Core's ARP table, figured out what IP it had and placed the server on a native trunk for that vlan. Once I did that, everything started working.

This got me thinking and I looked at the STP config on each core. The original designers had put 10 vlans with a priority of 8192 on Core 1 and 24 with a priority of 4096. Core 2's config was reversed. Once I took a good look at it, I noticed there were quite a few vlans missing from the equation (10 to be exact). They also setup each vlan with a standby configuration for HSRP. They mirrored each other but their priorities were backward compared to which switch was root (guess they thought lower number was active standby).

After correcting these mistakes, things began to run MUCH smoother. I set Core 1 with a priority of 4096 for all vlans and Core 2 with 8192. After looking through the rest of the switches and removing oddball spanning-tree vlan (range) priority 8192/4096 on edge switches, I was getting closer to having a completely functional STP network. I then found the bombshell. Someone had applied 'no spanning-tree vlan 1-1000' on a switch. I didn't expect this to cause problems, but once I fixed it, the whole network took a 3 second breath, and traffic began to flow. I was rather impressed with how fast it converged.

Now I am down to one last switch that isn't working properly, but isn't affecting the rest of the network. It shows a blank root for a vlan with no bridge ID or priority. This switch was showing an access-port as the root bridge (a Windows Server 2003 box is plugged into it) at one point. It was showing this with portfast, bpdufilter, and bpduguard enabled on the port. I really have no clue why it is doing that. I set the switch it is connected to with root guard on the outbound interface, but said switch still ignores all STP information from Core 1 on that vlan. It shows Core 1 as the root for all other vlans.

I even reduced the allowed vlans to a minimal configuration and still shows blank. When I apply 'debug spanning-tree all' to the switch and monitor it, not a single debug message pops up (even when I shut/no shut all interfaces). Why would this happen? A reboot doesn't resolve it either.

I may have to open a TAC case for this switch, I just hope it is on our Cisco contract...

In any event, I wanted to give everyone a very big thank you. This experience has been very educational.

Regards,

SrA Steven Denham

GFWAFBCCO
Level 1
Level 1

So I am still having issues with one switch. I was unable to open a TAC case.

The switch in question is a 3750 with IOS 12.2(55) ipbasek9-mz. I have removed all but vlans 9,17,22, and 69. The switch claims that vlan 22 has a bridge id of a MAC that is not on the network at all. Before some tinkering, I saw that port Fa1/0/22 was the designated port with a priority of 1. That port is an access port on vlan 22. The MAC of the device connected to it does not match the MAC it is saying is the root bridge (not even the same OIU). I have the port configured as such:

switchport mode access
switchport access vlan 22
spanning-tree portfast
spanning-tree bpdufilter enable
spanning-tree bpduguard enable
spanning-tree guard root

With or without guard root, it points to that unknown MAC and devices on the switch above it are also trying to use it as a root bridge. I confirmed that the device plugged into that port is a valid device. I can't go into detail, but I do know it isn't compromised. If it remove the device from the network, the problem persists. I am at a loss as to how to troubleshoot this further. Any recommendations are welcome.

Thank you,

SrA Steven Denham

Hi Steve,

a)   "Fa1/0/22 was the designated port"

you mean the root port?

b) "MAC of the device connected to it does not match the MAC it is saying is the root bridge"

Don't forget the MAC addresses used as a part of STP bgridge ID are NOT always the physical MAC addresses of the device NICs (compare your Cisco switch interface MAC with the Bridge ID MAC)

c)  the combination of

spanning-tree portfast
spanning-tree bpdufilter enable
spanning-tree bpduguard enable
spanning-tree guard root

on one port makes no sense:

Bpdufilter makes the port to ignore the STP BGPUs, Bpduguard disables the port when any BPDUs is received while rootguard puts the port ot STP-inconsistent (=Listen) stage when a better than current root BPDU is received on the port.

So I'd disconnect the switch from all the connections to the other switches (might be also access ports, not only trunks), then remove all the advanced features from the suspicious port and watch what BPDU comes. I believe there might be something like a Windows server with bridging enabled possibly?

HTH,

Milan

Milan,

"you mean the root port?"

You are correct, I did mean root port (RP, DP... all looks the same at 3 am).

"Don't forget the MAC addresses used as a part of STP bgridge ID are NOT always the physical MAC addresses of the device NICs (compare your Cisco switch interface MAC with the Bridge ID MAC)"
I was aware of this and still cannot find a reference to this MAC/Bridge ID. It doesn't match anything on the network.

"Bpdufilter makes the port to ignore the STP BGPUs, Bpduguard disables the port when any BPDUs is received while rootguard puts the port ot STP-inconsistent (=Listen) stage when a better than current root BPDU is received on the port.

So I'd disconnect the switch from all the connections to the other switches (might be also access ports, not only trunks), then remove all the advanced features from the suspicious port and watch what BPDU comes. I believe there might be something like a Windows server with bridging enabled possibly?"
I was trying to get the switch to err-disable, block, or otherwise do SOMETHING other than forward vlan 22 to somewhere else.

I took your advice and tweaked it a bit. I am not able to kill connectivity to this switch, so I did something a bit less intrusive. While looking at debug logs with a condition of interface Gi1/0/22 (it wasn't a Fa port, sorry for confusion), I noticed it saying "STP CFG: found port cfg Gi1/0/1 (hex number)". Then it continues to Rx and process the STP information. I looked at Gi1/0/1, which is also an access port on vlan 22. Odd I thought. So I shutdown the interface. Still no change on root bridge. So I decided to remove that vlan from the switch via the trunk allowed command. Then I wated for the STP debug to finish telling me it deleted the vlan. Once that completed, I added vlan 22 back to the trunk. VIOLA! The root bridge is now correct. Once I enabled Gi1/0/1, I see STP debug go nuts, but it hasn't overridden the bridge id yet. Here is some of the output (MAC masked for security):

013172: Dec 21 18:25:09.717: STP SW: RX ISR: 0180.c200.0000<-aaaa.aaaa.aaaa type/len 0026
013173: Dec 21 18:25:09.717:     encap SAP linktype ieee-st vlan 22 len 60 on v22 Gi1/0/1
013174: Dec 21 18:25:09.717:     42 42 03 SPAN
013175: Dec 21 18:25:09.717:     CFG P:0000 V:00 T:00 F:00 R:0001 aaaa.aaaa.aaaa 00000000
013176: Dec 21 18:25:09.717:     B:0001 aaaa.aaaa.aaaa 80.01 A:0000 M:1400 H:0200 F:0100
013177: Dec 21 18:25:09.717: STP SW: PROC RX: 0180.c200.0000<-aaaa.aaaa.aaaa type/len 0026
013178: Dec 21 18:25:09.717:     encap SAP linktype ieee-st vlan 22 len 60 on v22 Gi1/0/1
013179: Dec 21 18:25:09.717:     42 42 03 SPAN
013180: Dec 21 18:25:09.717:     CFG P:0000 V:00 T:00 F:00 R:0001 aaaa.aaaa.aaaa 00000000
013181: Dec 21 18:25:09.717:     B:0001 aaaa.aaaa.aaaa 80.01 A:0000 M:1400 H:0200 F:0100
013182: Dec 21 18:25:09.717: STP CFG: found port cfg GigabitEthernet1/0/1 (56FF4B0)

If I understand this correctly, it means it detected another STP bridge for vlan 22 with a priority of 0001 and an ID of aaaa.aaaa.aaaa on Gi1/0/1. It doesn't show that it dropped, blocked, or disabled anything though. Is this going to creep up again or otherwise cause issues?

Man it feels good to sort this out after 4 weeks non-stop. WHEW!

Regards,

SrA Steven Denham

Hi Steve,

I'd try to issue "sh spanntree int Gi1/0/1 detail" to check if the switch really received a BPDU with  a priority of 0001 in VLAN22.

If yes, I'd keep investigation on the device wich is connected to that port - STP priority 1 is quite suspicious, it gives 99% chance to become a root.

From a paranoid point of view: It might be used by some hacker tool trying to become an STP root and capture a LAN communication.

BR,

Milan

Review Cisco Networking products for a $25 gift card