03-10-2023 07:52 AM
Has anyone seen this behavior before or can comment on the specific syslog messaging being produced?
The problem started out as noticing intermittent ping loss to one switch, an Industrial Ethernet 2000. Looking though logging / syslog output, I tracked the ping loss to intermittent spanning tree "dispute" between two switches. It lasts for a few microseconds, resolves itself, then reoccurs approx every two seconds. (Spanning tree "hello" / BPDUs by default are every 2 seconds?) From what I observe, the IE2000, for some reason, decides it needs to try to become the root of the single vlan in the network. (The actual root in the network is a pair of independent (not-stacked) catalyst 3850 switches.) Also, there are many things that I'd like to change in the network - but I didn't design this network - someone else did and I'm just attempting to provide a root cause analysis to the customer.
As part of -trying- to become the root bridge this IE2000 transitions its uplink port, Gi1/1, (previously its root port) into a designated port. Of course, the switch connected to this IE2000 doesn't like that (two ports on the same link advertising Designated) and marks its own uplink port to the IE2000 as 'disputed' due to the inconsistency . Likewise, the IE2000 notices the inconsistency itself and does the same.
Moments later, though, the IE2000 snaps back into reality (receives a superior BPDU from the other switch) and recognizes the lower priority of the actual root bridge in the network. So it then falls in line, re-transitions its uplink (Gi1/1) port back to root port role, which the other switch recognizes, and the two mutually end the dispute. So normal forwarding of traffic resumes. But, around 2 second later - the IE2000 again decides to start this charade all over.
2023-03-09 17:15:27 Local7 Debug 192.168.80.1 10238: 010234: Mar 9 17:15:06.520 EDT: RSTP(1): transmitting an agreement on Gi1/1 as a response to a proposal
2023-03-09 17:15:27 Local7 Debug 192.168.80.34 37238: 037215: Mar 9 17:15:06.500 EDT: RSTP[1]: Gi1/2 dispute resolved
2023-03-09 17:15:27 Local7 Debug 192.168.80.34 37237: 037214: Mar 9 17:15:06.500 EDT: RSTP(1): received an agreement on Gi1/2
2023-03-09 17:15:27 Local7 Debug 192.168.80.34 37236: 037213: Mar 9 17:15:06.492 EDT: RSTP(1): transmitting a proposal on Gi1/2
2023-03-09 17:15:27 Local7 Debug 192.168.80.34 37235: 037212: Mar 9 17:15:06.484 EDT: RSTP(1): Gi1/2 Dispute!
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10237: 010233: Mar 9 17:15:06.512 EDT: STP[1]: Generating TC trap for port GigabitEthernet1/1
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10236: 010232: Mar 9 17:15:06.512 EDT: RSTP(1): synced Gi1/1
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10235: 010231: Mar 9 17:15:06.512 EDT: RSTP(1): Gi1/1 is now root port
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10234: 010230: Mar 9 17:15:06.512 EDT: RSTP(1): updt roles, received superior bpdu on Gi1/1
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10233: 010229: Mar 9 17:15:06.503 EDT: RSTP(1): Fa1/1 not in sync
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10232: 010228: Mar 9 17:15:06.503 EDT: RSTP(1): Gi1/1 is now designated
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10231: 010227: Mar 9 17:15:06.503 EDT: RSTP(1): we become the root bridge
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10230: 010226: Mar 9 17:15:06.503 EDT: RSTP(1): updt roles, information on root port Gi1/1 expired
2023-03-09 17:15:26 Local7 Debug 192.168.80.1 10229: 010225: Mar 9 17:15:06.503 EDT: RSTP(1): Gi1/1 rcvd info expired
I absolutely can provide more details on the network - but I just would like to get any initial feedback from the community. Ultimately, the customer here will very likely need to open a Cisco TAC support ticket. However, they do not have that service contract in place at the moment. I don't know what Spanning tree is trying to tell me with "dcvd info expired" which seems to kick off the IE2000's problem.
Attached is a longer output of the 2 second dance that these two IE2000 switches keep doing.
03-10-2023 08:19 AM
Hi
You can tune the STP on this switch and change timers:
Search for "Configuring Optional STP Parameters"
You can increase timers and reduce the occurrency of the problem.
Ultimately, if this switch have connection to the 3850 only and can not cause any loop, you can consider disable STP.
03-10-2023 08:19 AM
can I see config in both side ?
03-10-2023 09:37 AM
Yes. These are the two switches in particular having the dispute. M1-NS1 is the switch which tries to become the root bridge of VLAN 1. MS18-NS2 is the switch that noticed the inconsistency and raises the dispute. I'll provide more details on the network itself, after my lunch. Just want to get this out for now, thank you so very much.
03-10-2023 12:46 PM
The configuration is in the prior post - biggest help anyone could provide me with (anyone can answer, please): What the heck do "RSTP(1): Gi1/1 rcvd info expired" and "RSTP(1): updt roles, information on root port Gi1/1 expired" mean or possibly indicate?
It kind of sounds like some kind of 'time-out' limit was reached, maybe the switch though it was as such isolated from the rest of the network, as such, decided it should make it's own VLAN 1 with itself as the root bridge, however, microseconds later it gets a superior BPDU - realizes the error it made - abdicates, falls inline, updates the port role for its uplink port, this settles the "dispute", and normal traffic forwarding resumes, only for the same timer or timeout to be hit, which restarts the whole cycle all over again?
------
The physical layout for the IE2000's units has them connected in serial by fiber optic cable to form 'rings'.
So, for example: c3850 #1 --> IE2k #1 --> IE2k #2 --> IE2k #3 --> IE2k #4 --> IE2k #5 --> IE2k #6 --> IE2k #7 --> c3850 #2 (and c3850 #1 and #2 have a fiber connecting each other on their Gigabit riser card) -- There are four (4) of these types of rings in the network.
But relative to the problem I described in the first post: When the above example network ring has its full redundancy in place, no break in the "ring", then there is no Spanning Tree dispute problem as described. But, break the "ring", disconnect the fiber cable from IE2k #7 going to the c3850 #2, and all of a sudden you get the problem I described.
IE2k #6 will notice IE2k #7 going "rogue". #6 will block it's uplink port to #7. A fraction of a second later #7 will receive a superior BPDU, fall inline, resume normal operation, only to start the 'dispute' again.
03-10-2023 01:38 PM
c3850 #1 --> IE2k #1 --> IE2k #2 --> IE2k #3 --> IE2k #4 --> IE2k #5 --> IE2k #6 --> IE2k #7 --> c3850 #2
all port MSUT be FWD except one port will be BLK, which I think is 3850#2-IE2k#7
the Root SW must be 3850#1
1-now I see you config portfast and bpdugurad and that OK for any link connect to host but not need to config link connect to SW.
also you already run bpduguard in interface mode so no need to enable it in global mode.
2- you config loopguard in all SW!! the loopguard must enable only in IE2k#7 since it the only SW that have BLK port.
3-last point, you must check SW by SW to see if all SW run same STP mode R-PVST.
thanks
MHM
03-13-2023 09:12 AM
Thank you for the input.
Regarding point 3, going SW by SW to check config -- I pulled a copy of all of the switches configs. Both the c3850 and all 36 of IE2k. I used notepad++ to compare each of the IE2k switch configs to M1-NS1 and M18-NS2. There was no difference in STP mode between any of the IE2k switches, all were configured to R-PVST. As well, both c3850 were R-PVST. (That said, I did find unrelated inconsistencies, in config, in switch to switch comparisons, between the IE2k, and I corrected those. So all 36 IE2k now have completely consistent programing.)
Regarding point 1: The IE2k communicate to each other using their Gig interfaces (so Gi1/1 and Gi 1/2), hence, yes - no Portfast and no BPDUguard on those interfaces. (BPDUguard is enabled implicitly (globally) as a safeguard - but explicitly on the interface as a visual reminder)
Regarding points 2: SM fiber connects the IE2k to each other (Gi1/1 & Gi1/2) and back to the 'core' c3850 switches. Since Fiber cable is in use between the IE2k inside of a 'ring' isn't Loopguard and UDLD a necessity to look for unidirectional traffic (broken or damaged fiber)? Or do I misunderstand the utility of Loopguard?
Also, secondary things i've tried: the fiber between M18-NS2 and M1-NS1 (IE2k #6 and IE2k #7, in example, respectively) has been replaced. Also the SFP module in both have been replaced. In fact, just to be certain this couldn't be a hardware issue - the entire M1-NS1 (IE2k #7) was itself replaced with new out of box replacement unit.
03-13-2023 03:59 PM
OK, can you share the
show udld port <<- port connect both 3850
then check the BPDU send/receive in each port along the path
show spanning tree interface x/x detail
designated port must send ONLY
root and BLK port must receive ONLY
03-20-2023 10:57 AM - edited 03-20-2023 12:15 PM
Sorry it's taken a while - been doing troubleshooting, testing, and working to build a "product" of sorts to deliver to Cisco TAC.
"show udld port <<- port connect both 3850" actually -- media converters are used prior to the connections to the c3850. So, the Fiber optic cable plugs into a fiber to RJ45 Ethernet media converter.
Gig 1/1 and Gig 1/2 on M1-NS1 show BPDUs both being send and received, though. As do the corresponding interfaces on the core network switches. And switches inbetween on the "loop".
The additional testing that I've done has shown that if I remove just one IE2k from the loop longest loop the problem disappears. That is, when the total number of IE2k in the loop equals 16, not 17, and the cable between M1-NS1 and the core NS1 (a c3850 switch) is disconnected then M1-NS1 no longer tries to become the root for VLAN 1. Since it no longer attempts that, of course, there is no STP dispute anymore and M18-NS2 no longer has block M1-NS1.
So with regard to RSTP is there a maximum length, or diameter, number of switches that can be chained together that has been exceeded by the customers network design?
I am attaching a sanitized diagram of the network for the customer.
03-20-2023 12:11 PM
Another observation is that: Confusingly, this "dispute" is uni-directional. That is, it only happens only when the fiber from M1-NS1 going to Core-NS1 is disconnected. If you disconnect the M12-NS1 to Core-NS2 link - which still leaves you with 17 IE2k in the loop - no dispute arises or happens.
But, yet, even still ... I found that I can migrate the "dispute" from M1-NS1 to M12-NS1, if I swap the end of the loop between the 'core' network switches. That is plug the line from M12-NS1 into Core-NS1 and plug the line from M1-NS1 into Core-NS2, then disconnect the cable Core-NS1 to M12-NS1. Then M12-NS1 will throw out the same errors that M1-NS1 previously did.
03-21-2023 11:17 AM
... the issue is "semi"-ish resolved? I adjusted the hello-timer for RSTP in the network and while the "dispute" between M1-NS1 and M18-NS2 still happens. The dispute seemingly only happens one time and after that resolves itself. That is, it is no longer a perpetually repeating problem and the network is stable with the M1-NS1 to Core-NS1 connection broken.
I don't like this solution, because I ultimately do not know why it was happening in the first place or why more frequent hello / BPDU stabilizes the RSTP dispute after its first occurrence. Moreover, why it doesn't happen in the reverse. Or, more specifically, only happens when the connection from the root switch (Core-NS1) is broken. From, further testing, I found if I change the priorities for Core-NS1 and Core-NS2 around, making Core-NS2 primary and Core-NS1 secondary, then spanning tree dispute only happens when M12-NS1 is disconnected from Core-NS2.
03-21-2023 01:17 PM - edited 03-21-2023 01:18 PM
Best I can figure without hearing from some expert at Cisco TAC is that either my client has reached some Rapid Spanning tree limit where beyond 16 switches the performance of RSTP becomes unstable .... or I've found a bug in IOS operation. Where adjusting the Hello timer to 1 second manages to moderates the unstable behavior.
Google results for Spanning Tree diameter seem kinda conflicting some saying 7 connected switches is a hard limit, but a rule of thumb, but they all reference regular (slow) Spanning Tree. I'd have to think that in the 30 - 20 years since stp/rstp introduction that their would be enhancements or improvements that would increase the useability above 7 switches, but I'm not finding much supportive discussion or documentation (outside of, -legitimate- advice to improve the archictural design and complimentary programming of the network, better).
03-21-2023 01:49 PM
I will make look to topology you share with new info, you provide.
I will update you soon about points must check