Trunk ports dropping every 60 seconds

Mark Schlosser · ‎07-10-2015

Title correction - Trunk ports dropping every 60 minutes

I had three older 2960 switches at a facility (SW01, SW02 & SW03). June 29th I added two 2960s' (SW04 & SW05). They are currently connected with a simple, singular Gb copper trunk link (no Etherchannel) SW05<->SW04<->SW02<->SW03<->SW01.

Every 60 minutes switches 5,4,& 2 go thru a Spanning Tree gyration for about a minute breaking all facility communications. They have done this since they were installed 12 days ago.

SW05 has STP priority 4096 and is root (confirmed) and RSTP is used thru all switches. There are only two vlans - 1 & 3

I have ran

- debug spanning-tree events

- debug spanning-tree root

- debug spanning-tree switch state

- debug pm vp

on all 5 switches. I will have no hits in the logs until 10 minutes past the hour (it is shifting roughly 5 seconds longer each iteration) and then the logs explode for two minutes. Time on all switches is synchronized via NTP against the same Stratum 2 server.

In SW05, SW04 & SW02 the first or second lines in the logs are "STP:VLAN0001 we are the spanning tree root" and "STP:VLAN0003 we are the spanning tree root" (kinda like "we are the Borg") letting me know that STP is now running thru its steps thinking there has been some sort of reconfiguration.

I added "debug pm all" to SW04 and cleared all my logs. Here is the output of this on SW04.

Jul 10 16:10:56.696 pdt: pm_extern_port_link_event: link-down on Gi1/0/28
Jul 10 16:10:56.696 pdt: pm_port 1/28: during state trunk, got event 5(link_down)
Jul 10 16:10:56.696 pdt: @@@ pm_port 1/28: trunk -> pagp
Jul 10 16:10:56.696 pdt: port_exit_trunk_state: Gi1/0/28
Jul 10 16:10:56.696 pdt: port_enter_pagp_state: Gi1/0/28 operSync OFF
Jul 10 16:10:56.696 pdt: port_remove_trunk_vlans: Gi1/0/28
Jul 10 16:10:56.696 pdt: pm_vlan_rem_port: vlan 1, port 28
Jul 10 16:10:56.702 pdt: pm_vlan_rem_port: vlan 1 is a SVI and it is up
Jul 10 16:10:56.702 pdt: pm_vp 1/28(1): during state forwarding, got event 7(trunk_remove)
Jul 10 16:10:56.702 pdt: @@@ pm_vp 1/28(1): forwarding -> notforwarding
Jul 10 16:10:56.702 pdt: vp_trunk_notfwd_action: Gi1/0/28(1)
Jul 10 16:10:56.702 pdt: pm_vlan 1: during state vlan_forwarding, got event 5(port_notfwd)
Jul 10 16:10:56.702 pdt: @@@ pm_vlan 1: vlan_forwarding -> vlan_forwarding
Jul 10 16:10:56.702 pdt: port_notfwd_action (7 ports, 7 up, 7 fwd, 5 acc)
Jul 10 16:10:56.702 pdt: @@@ pm_vp 1/28(1): notforwarding -> present
Jul 10 16:10:56.702 pdt: vp_trunk_linkdown_action: Gi1/0/28(1)
Jul 10 16:10:56.702 pdt: pm_vlan 1: during state vlan_forwarding, got event 3(port_linkdown)
Jul 10 16:10:56.702 pdt: @@@ pm_vlan 1: vlan_forwarding -> vlan_forwarding
Jul 10 16:10:56.702 pdt: port_linkdown_action (7 ports, 6 up, 7 fwd, 5 acc)
Jul 10 16:10:56.702 pdt: @@@ pm_vp 1/28(1): present -> not_present
Jul 10 16:10:56.702 pdt: vp_trunk_remove_action: Gi1/0/28(1)
Jul 10 16:10:56.702 pdt: pm_vlan 1: during state vlan_forwarding, got event 1(port_remove)
Jul 10 16:10:56.702 pdt: @@@ pm_vlan 1: vlan_forwarding -> vlan_forwarding
Jul 10 16:10:56.702 pdt: port_remove_action (6 ports, 6 up, 7 fwd, 5 acc)
Jul 10 16:10:56.702 pdt: pm_vlan_rem_port: vlan 2, port 28
Jul 10 16:10:56.702 pdt: pm_vlan_rem_port: vlan 3, port 28
Jul 10 16:10:56.702 pdt: pm_vp 1/28(3): during state forwarding, got event 7(trunk_remove)
Jul 10 16:10:56.702 pdt: @@@ pm_vp 1/28(3): forwarding -> notforwarding
Jul 10 16:10:56.702 pdt: vp_trunk_notfwd_action: Gi1/0/28(3)
Jul 10 16:10:56.702 pdt: pm_vlan 3: during state vlan_forwarding, got event 5(port_notfwd)
Jul 10 16:10:56.702 pdt: @@@ pm_vlan 3: vlan_forwarding -> vlan_forwarding
Jul 10 16:10:56.702 pdt: port_notfwd_action (3 ports, 3 up, 2 fwd, 0 acc)
Jul 10 16:10:56.702 pdt: @@@ pm_vp 1/28(3): notforwarding -> present
Jul 10 16:10:56.702 pdt: vp_trunk_linkdown_action: Gi1/0/28(3)
Jul 10 16:10:56.702 pdt: pm_vlan 3: during state vlan_forwarding, got event 3(port_linkdown)
Jul 10 16:10:56.702 pdt: @@@ pm_vlan 3: vlan_forwarding -> vlan_forwarding
Jul 10 16:10:56.702 pdt: port_linkdown_action (3 ports, 2 up, 2 fwd, 0 acc)
Jul 10 16:10:56.702 pdt: @@@ pm_vp 1/28(3): present -> not_present
Jul 10 16:10:56.702 pdt: vp_trunk_remove_action: Gi1/0/28(3)
Jul 10 16:10:56.702 pdt: pm_vlan 3: during state vlan_forwarding, got event 1(port_remove)
Jul 10 16:10:56.702 pdt: @@@ pm_vlan 3: vlan_forwarding -> vlan_forwarding
Jul 10 16:10:56.702 pdt: port_remove_action (2 ports, 2 up, 2 fwd, 0 acc)

In every instance SW04 leads the event logs by a few milliseconds with its action against Gi1/0/26 (trunk to SW05) and Gi1/0/28 (trunk to SW02). To me this makes be believe SW04 is the culprit ans SW05 & SW02 are simply reacting to the downed ports. The config for these port is pretty vanallia -

interface GigabitEthernet1/0/26
description **** xxxxxxxxxxx ****
switchport mode trunk

interface GigabitEthernet1/0/28
description **** xxxxxxxxxxx ****
switchport mode trunk

What can cause a 2960s (running 15.2(2a)E1) to down its trunk ports every 60 minutes?

Peter Paluch · ‎07-10-2015

Mark,

Is it possible that for whatever reasons, someone has changed the STP timers from their default values, that is, Hello=2s, ForwardDelay=15s, MaxAge=20s, perhaps as a result of running the spanning-tree root primary macro with the diameter argument? If these timers were aggressively tuned down then there is a possibility that BPDUs are being aged out prematurely.

Please check for absolutely any STP-related commands on all your switches. Ideally, there should be absolutely nothing except:

spanning-tree mode rapid-pvst
spanning-tree vlan 1,3 priority ...
spanning-tree portfast default

I would be interested in knowing if there is anything STP-related configured on your switches aside from these three commands at most, no matter how benign or insignificant that may seem.

Best regards,
Peter

Mark Schlosser · ‎07-13-2015

Peter,

First off, thank you fo taking the time to reply with your thoughts.

I too, initialy suspected Spannig tree timers & priority values. I verified all switches were using PVSTP and I used the macro "spanning-tree vlan 1-3 root primary" on SW05 and "spanning-tree vlan 1-3 root secondary" on SW04. After I continued to see occurances of this issue I manually set the spanning-tree priority of SW05 to 4096, left SW04 at the macro-adjusted value of 28672, and ran the other three switches priority to 40960. THings didn't change.

Show Spamning-tree on all switches relfect a proper choice of the root bridge with a priority of 4097 for vlan1 and 4099 for vlan3 with the BIA of SW05. All tiemrs are the standard default - 2s Hello, 15s Fwd Delay & 20s for MaxAge.

Suspecting that the issue wasn't spanning-tree but rather spanning-tree reacting to something strange happening within SW04 via the port manager, I went ahead and roleld SW05 & SW04 back form IOS 15.2(2a).E1 (the bleeding edge for 2960s) to the Cisco recommended 15.0(2)SE8 last night at 9:00 PM. I checked the logs just prior to reboot to confirm this 60 minute issue had continued up to that point on SW05, SW04 & SW02. I am please to say that I have not seen one peep of this issue since and the switches have been up for 12 hours now.

Both SW04 & SW05 had been rebooted prior to this and the reboot had no evffect on the occurance of this issue.

The only anomoly I can point to is a self infliceted issue with SW04 during the build out. SW04 was configured by a new, green Jr tech I recently hired. I asked him to get IOS 15.2(2) on the switch and he proceeded to delete the file system. This forced us to use the ROMMON/SWITCH> mode to get a bin file back in the root of the flash: and from there we were able to boot the switch and use the archive download-sw command to get 15.2(2) on it. However, this left the switch with two oddities.

There is no “credential.lic” file
The unit had c2960s-universalk9-mz.152-2a.E1.bin in the root of the flash: as well as having the c2960s-universalk9-mz.152-2.E1 folder created by the extraction of the tar file.

I am uncertain if either of these two had any impact on the 60 minute cycle these switches were experiencing but I want to lay every oddity out there incase there is some unpublished bug that I ran into.

Peter Paluch · ‎07-13-2015

Hi Mark,

Thank you for explaining the circumstances. At this point, I am afraid, we both can only guess what was the primary root cause of your problem. Definitely, though, this problem appears to be related to the IOS version 15.2(2) you tried to use. Indeed, this can be a bug. I do not believe that the "credentials.lic" nor the erased directory in your FLASH could have had anything with it.

Let's see if the links are stable.

Best regards,
Peter

Mark Schlosser · ‎07-13-2015

Peter,

Once again - thanks for your time. We have been up for over 18 hours and not one instance of the trunk links going down.

On an interesting side note, the output of the "debug pm vp" command generated several lines that said "during state forwarding, got event 7(trunk_remove)". If you google this (in quotes) you will find it nowhere but this post. How is it that I ran into a Cisco debug message that no one has every posted on a blog or discussion forum before?

mTarawneh123 · ‎07-13-2015

Hi Mark ,

it is known bug for this platform. As I remember fix is not available yet, and the only solution is to downgrade the software. you can open TAC case for further information.

hope that helps.

regards,

Peter Paluch · ‎07-14-2015

Hello,

This is interesting - can you perhaps point us to other sources that describe a similar symptom?

Best regards,
Peter

Mark Schlosser · ‎07-14-2015

mTarawneh123,

Can you elaborate a little on where you got this information? Both of these units are not under SmartNet so I didn't have TAC to turn to and I lost a lot of time pinning this down. Some of the debug outputs produce error messages that show up nowhere else but this post like "during state forwarding, got event 7(trunk_remove)" from debug pm vp.