Strange problem of 2960X switch stacks not passing traffic

nnraymond · ‎09-30-2020

Our school district has four grammar schools with very similar setups that have been working well for years, but for the last month one of the schools has been experiencing issues. The setup at each location is a WS-C3850-48P as the core switch in the server room, and we have between one and three data closets with WS-C2960X-48FPD-L switch stacks in them. Those closets are connected by single 10Gig fiber trunks to the core switch at each site. There is just one VLAN per building (a 255.255.240 subnet). Each school has about 70 Meraki access points. There are a small number of legacy HP ProCurve 2650 switches in some closets connected to the C2960X stacks.

For the last month at one of the schools, on weekdays, one of the two closets with the WS-C2960X-48FPD-L switch stacks will stop passing traffic, either in the morning around 9am or around midday. Never on weekends, never at night, and not necessarily every day. When a switch closet stops passing traffic I cannot reach any host in that closet (including the switch management interface on the stack) from anywhere else on the network, and those hosts can't reach the rest of the network. If I connect to the 2960X console on the affected closet, everything looks good. The only error message I found was some MAC flapping being reported on the 3850 core switch around the time when a closet went down, and when I saw that I upgraded the firmware on the core and closets, and since the firmware upgrades I haven't seen any MAC flapping. During these incidents the fiber interface to the affected closet looks fine, I have checked the dB and signal is good. I have shut down the fiber trunk and turned it back on, no difference. I have physically unplugged it and plugged it back in, no difference. The only thing I've found that works is rebooting the stack. If the stack is rebooted, it is typically good for at least a day. There were no changes made to the configuration of this network in the last few months and there were no issues with it in previous school years or during the summer.

I upgraded the 3850 core to firmware 03.06.09E and the 2960X switch stacks to 15.2(2)E7 but it hasn't fixed this issue. Any thoughts what the problem might be or suggestions on how to isolate the issue would be greatly appreciated.

balaji.bandi · ‎09-30-2020

what i suggest hereafter reading big explanation.

please make a small diagram and post-show version. show run config post here so we can look and suggest.

also give some more inputs, when you lost everything, what is MAC address table show ? or IP arp information?

CPU process?

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

nnraymond · ‎09-30-2020

The next time one of the 2960X stacks stops passing traffic, I will try to grab the MAC address table, IP arp info, and CPU processes. The tricky thing about this problem is that the schools are doing hybrid learning and this only happens during fairly high network utilization periods which means teachers are in the middle of teaching students many of whom are remote when half the building goes down so I've been under a lot of pressure to get things up and running again ASAP so I haven't had a lot of time to do diagnostics.

The diagram is pretty simple:

3850 (core switch)

| |

2960X 2960X

stack stack

There are three 2960X switches in each of the stacks, and they are connected by a single fiber trunk back to the 3850. Let me know if you need something more than that (like how things are connected beyond the building, etc.) I've attached the version and running-config from one of the 2960X stacks to this post (let me know if you need more). I just reviewed what I've attached, and first off I realized that I actually upgraded the 2960X stacks to 15.2(4)E10 as show in the version output. Also, I noticed something I don't remember seeing before in the running-config - "ip access-list extended CISCO-CWA-URL-REDIRECT-ACL". According to this, it looks like I shouldn't be seeing it:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCux72245/?rfs=iqvred

The 15.2(4)E10 firmware doesn't appear to be implicated in this bug... any idea what is up with that?

Leo Laohoo · ‎09-30-2020

I have strong hunch the issue is with the 3850.

Cisco has announced the end-of-sale of the 3.6.X train back on 31 October 2016. 3.6.10, the last firmware to come out, was back on 17 May 2019. This makes 3.6.9 a lot older.

I would recommend you look into something more recent like 16.6.X (no Smart Licensing), or 16.12.X.

NOTE:

We have several stacks running 16.9.5 and I personally would not recommend 16.9.X trains because we are seeing strange happenings.
16.12.5 will be released in a few days.

umairali.khan · ‎09-30-2020

is there any HP or other non cisco switch connected to 2960X stack at which this issue occurs?

if yes then look for any loops on those switches.

i would suggest to remove them or replace them with cisco.

some non-cisco small business switches have spanning-tree and other loop detection features disabled by default.

nnraymond · ‎10-01-2020

When this happens to a 2960X stack, we've always been able to restore things just by rebooting the affected 2960X stack. We haven't rebooted any HP Procurve 2650 switches connected to the 2960X stack. I've assumed that if there was a problem on one of the HP switches, that it would quickly re-occur after the Cisco stack came back up, and that hasn't happened. Also this building's network is pretty much in two halves - the server room and core 3850 are in the middle of the building, and the two 2960X switch closets each have pretty much half of the network devices in the building on each of them, and this problem has been happening in both switch closets (sort of alternating, but not with any immediately discernible pattern). Seems unlikely this this one building (out of 4 very similar building networks) would suddenly develop problems on more than one HP switch in different parts of the building at the same time.

I just did two things - I upgraded the 3850 core switch to cat3k_caa-universalk9.16.06.08.SPA.bin. I also made sure both spanning tree and HP's loop protection is enabled on all edge ports on the HP switches.

nnraymond · ‎10-08-2020

Recap: I've updated the firmware on the 3850 core and the 2960X stacks, and I also updated our Meraki AP firmware from an older stable release (26.6.1) to the current stable release (27.5), so I could eliminate that as any possible source of issues (FYI we have been running Meraki firmware 26.6.1 on our other networks that aren't showing any issues). I also disconnected the HP switches that were connected to the 2960X stacks. Unfortunately none of that made any difference, and today one of the closets stopped passing traffic again mid-morning. Fortunately I was there at the time, and had just recorded a bunch of info from that closet. Specifically I ran these commands:

show processes cpu sorted 5sec
show controllers cpu-interface
show platform port-asic stats drop
show memory summary
show mac address-table
show ip arp
show interfaces stats
show interfaces switching

I have captured the output of those commands run before the stack acted up in "tny2960-a1 before.txt" and then the output of the commands while the stack was acting up in "tny2960-a1 during.txt". Any insights would be greatly appreciated.

Leo Laohoo · ‎10-08-2020

Can you try pulling the power cables out of the 3850 (cold boot)?

So pull the power cables out, wait for all the "spinning" noise to end, count to five (seconds), and put the power back on.

See if this helps.

nnraymond · ‎10-08-2020

I'll try that, but keep in mind I recently updated the firmware in that switch to cat3k_caa-universalk9.16.06.08.SPA.bin so it did go through a full reboot cycle a few days ago. Also, what kind of problem would the 3850 have that would be solved by rebooting a 2960X stack each time? Unplugging and replugging the fiber trunk connection between a problem 2960X stack and the 3850 makes no difference while it's happening, but rebooting the 2960X stack makes a difference. I collected this info from the fiber connection going from the 3850 to the 2960X stack that acted up this morning while the problem was ongoing:

TNY3850-mdf#show interface Te1/1/1 summary

 *: interface is up
 IHQ: pkts in input hold queue     IQD: pkts dropped from input queue
 OHQ: pkts in output hold queue    OQD: pkts dropped from output queue
 RXBS: rx rate (bits/sec)          RXPS: rx rate (pkts/sec)
 TXBS: tx rate (bits/sec)          TXPS: tx rate (pkts/sec)
 TRTL: throttle count

  Interface                   IHQ       IQD       OHQ       OQD      RXBS      RXPS      TXBS      TXPS      TRTL
-----------------------------------------------------------------------------------------------------------------
* Te1/1/1                       2         0         0         0    211000       215    317000       184         0
TNY3850-mdf#show interface Te1/1/1 transceiver detail
ITU Channel not available (Wavelength not available),
Transceiver is internally calibrated.
mA: milliamperes, dBm: decibels (milliwatts), NA or N/A: not applicable.
++ : high alarm, +  : high warning, -  : low warning, -- : low alarm.
A2D readouts (if they differ), are reported in parentheses.
The threshold values are calibrated.

                              High Alarm  High Warn  Low Warn   Low Alarm
           Temperature        Threshold   Threshold  Threshold  Threshold
Port       (Celsius)          (Celsius)   (Celsius)  (Celsius)  (Celsius)
---------  -----------------  ----------  ---------  ---------  ---------
Te1/1/1      29.7                   75.0       70.0        0.0       -5.0

                              High Alarm  High Warn  Low Warn   Low Alarm
           Voltage            Threshold   Threshold  Threshold  Threshold
Port       (Volts)            (Volts)     (Volts)    (Volts)    (Volts)
---------  -----------------  ----------  ---------  ---------  ---------
Te1/1/1      3.26                   3.63       3.46       3.13       2.97

                              High Alarm  High Warn  Low Warn   Low Alarm
           Current            Threshold   Threshold  Threshold  Threshold
Port       (milliamperes)     (mA)        (mA)       (mA)       (mA)
---------  -----------------  ----------  ---------  ---------  ---------
Te1/1/1      41.6                   80.0       74.0       12.0       10.0

           Optical            High Alarm  High Warn  Low Warn   Low Alarm
           Transmit Power     Threshold   Threshold  Threshold  Threshold
Port       (dBm)              (dBm)       (dBm)      (dBm)      (dBm)
---------  -----------------  ----------  ---------  ---------  ---------
Te1/1/1      -2.0                    3.4        0.4       -6.4      -10.5

           Optical            High Alarm  High Warn  Low Warn   Low Alarm
           Receive Power      Threshold   Threshold  Threshold  Threshold
Port       (dBm)              (dBm)       (dBm)      (dBm)      (dBm)
---------  -----------------  ----------  ---------  ---------  ---------
Te1/1/1      -4.3                    3.4        0.4       -8.4      -12.4


TNY3850-mdf#show interface Te1/1/1 capabilities
TenGigabitEthernet1/1/1
  Model:                 WS-C3850-48P
  Type:                  SFP-10GBase-LRM
  Speed:                 10000
  Duplex:                full
  Trunk encap. type:     802.1Q
  Trunk mode:            on,off,desirable,nonegotiate
  Channel:               yes
  Broadcast suppression: percentage(0-100)
  Flowcontrol:           rx-(off,on,desired),tx-(none)
  Fast Start:            yes
  QoS scheduling:        rx-(not configurable on per port basis),
                         tx-(2p6q3t)
  CoS rewrite:           yes
  ToS rewrite:           yes
  UDLD:                  yes
  Inline power:          no
  SPAN:                  source/destination
  PortSecure:            yes
  Dot1x:                 yes
TNY3850-mdf#show idprom interface Te1/1/1

General SFP Information
-----------------------------------------------
Identifier            :   SFP/SFP+
Ext.Identifier        :   SFP function is defined by two-wire interface ID only
Connector             :   LC connector
Transceiver
 10GE Comp code       :   Unknown
 SONET Comp code      :   OC 48 short reach
 GE Comp code         :   Unknown
 Link length          :   Unknown
 Technology           :   Unknown
 Media                :   Single Mode
 Speed                :   Unknown
Encoding              :   64B/66B
BR_Nominal            :   10300 Mbps
Length(9um)-km        :   GBIC does not support single mode fibre
Length(9um)           :   GBIC does not support single mode fibre
Length(50um)          :   220 m
Length(62.5um)        :   220 m
Length(Copper)        :   GBIC does not support 50 micron multi mode OM4 fibre
Vendor Name           :   CISCO-AVAGO
Vendor Part Number    :   SFBR-7600SDZ-CS3
Vendor Revision       :   0x31 0x2E 0x34 0x20
Vendor Serial Number  :   AGD1831V189
Wavelength            :   1310 nm
CC_BASE               :   0x0E
-----------------------------------------------

Extended ID Fields
-----------------------------------------------
Options               :   0x01 0x1A
BR, max               :   0x00
BR, min               :   0x00
Date code             :   140803
Diag monitoring       :   Implemented
Internally calibrated :   Yes
Exeternally calibrated:   No
Rx.Power measurement  :   Avg.Power
Address Change        :   Not Required
CC_EXT                :   0x17
-----------------------------------------------

Other Information
-----------------------------------------------
Chk for link status   : 00
Flow control Receive  : ON
Flow control Send     : ON
Administrative Speed  : 10000
Administrative Duplex : full
Operational Speed     : 10000
Operational Duplex    : full
-----------------------------------------------

SEEPROM contents (hex):
 0x00: 03 04 07 40 00 00 00 00 00 00 00 06 67 00 00 00
 0x10: 16 16 00 16 43 49 53 43 4F 2D 41 56 41 47 4F 20
 0x20: 20 20 20 20 00 00 17 6A 53 46 42 52 2D 37 36 30
 0x30: 30 53 44 5A 2D 43 53 33 31 2E 34 20 05 1E 00 0E
 0x40: 01 1A 00 00 41 47 44 31 38 33 31 56 31 38 39 20
 0x50: 20 20 20 20 31 34 30 38 30 33 20 20 68 F0 03 17
 0x60: 00 00 06 E2 CA 89 47 2F 29 7F 76 7B F5 DD 8B 4C
 0x70: D4 5F 03 00 00 00 00 00 00 00 00 00 EE 46 F8 B3
-----------------------------------------------

Leo Laohoo · ‎10-08-2020

Warm reboot (using the "reload" command) and cold reboot (pull the power cable) are different from each other.

But for the 3650/3850 (and lately the 9300), there is something I want to "catch" that will only get "fixed" in a cold reboot.

And if cold reboot does not fixes the issue then you've just proven me wrong.

nnraymond · ‎10-20-2020

Leo, so far it looks like you were right! Since cold booting the 3850, the problem has NOT come back - it's been a week of no trouble with the downstream switch stacks. What in the world was going on that a cold boot solved? Is there a better solution?

I'm now wondering if a version of this problem also periodically affects our server VMs... We have VM hosts (which were ESXi, recently we migrated to Proxmox) at our sites with the 3850 core switches, and those VM hosts are plugged directly into the 3850s. About once every couple of weeks at some of these sites a VM will be unable to reach the network. No problem with the VM host or other VMs on the host. The VM is running fine while the problem is going on (much like our 2960X switch stacks) and the 'fix' has been to either reset the virtual network adapter in the VM or soft reboot the VM. There is no pattern to the problem but it has only been affecting our 3850 core switch sites - our high school has a Catalyst WS-C6509-E as its core switch and the VMs there don't ever have this problem.

Leo Laohoo · ‎10-20-2020

@nnraymond wrote:

What in the world was going on that a cold boot solved?

IOS-XE, which the 3850 runs on, has one "hidden" flaw: The "reload" command does not have the same "effect" as the classic IOS. Sure, it reloads the switch but there are some process that is not meant to "die" during this reload process. Cold reboot, however, is a totally different world. When a cold reboot is performed, all bets are off and nothing will "survive" it. All process has to stop regardless.

If you have the time to read the Release Notes to various versions for switches running IOS-XE, you'll notice a pattern of bugs which the workaround (NOTE: I said "workaround" and not a "fix) is a cold reboot of the stack or affected appliance. And my two "favorite" bugs are the "MOSFET twins" : CSCvj76259 & CSCvd46008.
MOSFET is a transistor component that is, currently, present in Cisco switches running IOS-XE (3650/3850 and Catalyst 9k). Over time, the MOSFET component degrades. And when this happens, the ports simply stops "talking: No PoE, no traffic. Nothing. The only way to "prove" it is MOSFET is a cold reboot. If cold reboot "wakes up" the port(s), then the switches are affected by the bugs. Specifically to MOSFET twins, cold-reboot is only a workaround. This is not a permanent fix. Over time, regardless of whatever firmware the switch/stack is running on, this will come back. Troubleshooting MOSFET is a PITA because you need two people and time. One person on the (remote) console and another person to move the connection(s) to different parts of the stack.

Apologies for the long response.

nnraymond · ‎10-29-2020

I appreciate the length of the response. A couple of questions - the two bugs you mentioned seem to be related to PoE, but my case, this is happening over 10Gbit SFP+ fiber transceivers. Can these MOSFET bugs impact those as well? Since a cold reboot boot is just a workaround (and in fact the problem just came back today for the first time after the cold boot you recommended), what is the long-term fix? Replace the 3850 with another 3850, or avoid the 3850 and get a different model entirely?

Leo Laohoo · ‎10-29-2020

3650/3850 as well as the Catalyst 9k switching platform have several well-known "features" (euphemism for bugs) where the optical ports simply stops talking. The reload command does not make any difference, however, the workaround is to COLD REBOOT the entire appliance or stack.

Again, my emphasis is the importance of the cold reboot vs the more traditional "reload" command.