09-30-2020 07:49 AM
Our school district has four grammar schools with very similar setups that have been working well for years, but for the last month one of the schools has been experiencing issues. The setup at each location is a WS-C3850-48P as the core switch in the server room, and we have between one and three data closets with WS-C2960X-48FPD-L switch stacks in them. Those closets are connected by single 10Gig fiber trunks to the core switch at each site. There is just one VLAN per building (a 255.255.240 subnet). Each school has about 70 Meraki access points. There are a small number of legacy HP ProCurve 2650 switches in some closets connected to the C2960X stacks.
For the last month at one of the schools, on weekdays, one of the two closets with the WS-C2960X-48FPD-L switch stacks will stop passing traffic, either in the morning around 9am or around midday. Never on weekends, never at night, and not necessarily every day. When a switch closet stops passing traffic I cannot reach any host in that closet (including the switch management interface on the stack) from anywhere else on the network, and those hosts can't reach the rest of the network. If I connect to the 2960X console on the affected closet, everything looks good. The only error message I found was some MAC flapping being reported on the 3850 core switch around the time when a closet went down, and when I saw that I upgraded the firmware on the core and closets, and since the firmware upgrades I haven't seen any MAC flapping. During these incidents the fiber interface to the affected closet looks fine, I have checked the dB and signal is good. I have shut down the fiber trunk and turned it back on, no difference. I have physically unplugged it and plugged it back in, no difference. The only thing I've found that works is rebooting the stack. If the stack is rebooted, it is typically good for at least a day. There were no changes made to the configuration of this network in the last few months and there were no issues with it in previous school years or during the summer.
I upgraded the 3850 core to firmware 03.06.09E and the 2960X switch stacks to 15.2(2)E7 but it hasn't fixed this issue. Any thoughts what the problem might be or suggestions on how to isolate the issue would be greatly appreciated.
09-30-2020 01:39 PM
what i suggest hereafter reading big explanation.
please make a small diagram and post-show version. show run config post here so we can look and suggest.
also give some more inputs, when you lost everything, what is MAC address table show ? or IP arp information?
CPU process?
09-30-2020 07:26 PM
The next time one of the 2960X stacks stops passing traffic, I will try to grab the MAC address table, IP arp info, and CPU processes. The tricky thing about this problem is that the schools are doing hybrid learning and this only happens during fairly high network utilization periods which means teachers are in the middle of teaching students many of whom are remote when half the building goes down so I've been under a lot of pressure to get things up and running again ASAP so I haven't had a lot of time to do diagnostics.
The diagram is pretty simple:
3850 (core switch)
| |
2960X 2960X
stack stack
There are three 2960X switches in each of the stacks, and they are connected by a single fiber trunk back to the 3850. Let me know if you need something more than that (like how things are connected beyond the building, etc.) I've attached the version and running-config from one of the 2960X stacks to this post (let me know if you need more). I just reviewed what I've attached, and first off I realized that I actually upgraded the 2960X stacks to 15.2(4)E10 as show in the version output. Also, I noticed something I don't remember seeing before in the running-config - "ip access-list extended CISCO-CWA-URL-REDIRECT-ACL". According to this, it looks like I shouldn't be seeing it:
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCux72245/?rfs=iqvred
The 15.2(4)E10 firmware doesn't appear to be implicated in this bug... any idea what is up with that?
09-30-2020 07:28 PM
I have strong hunch the issue is with the 3850.
Cisco has announced the end-of-sale of the 3.6.X train back on 31 October 2016. 3.6.10, the last firmware to come out, was back on 17 May 2019. This makes 3.6.9 a lot older.
I would recommend you look into something more recent like 16.6.X (no Smart Licensing), or 16.12.X.
NOTE:
09-30-2020 09:54 PM
is there any HP or other non cisco switch connected to 2960X stack at which this issue occurs?
if yes then look for any loops on those switches.
i would suggest to remove them or replace them with cisco.
some non-cisco small business switches have spanning-tree and other loop detection features disabled by default.
10-01-2020 03:52 PM
When this happens to a 2960X stack, we've always been able to restore things just by rebooting the affected 2960X stack. We haven't rebooted any HP Procurve 2650 switches connected to the 2960X stack. I've assumed that if there was a problem on one of the HP switches, that it would quickly re-occur after the Cisco stack came back up, and that hasn't happened. Also this building's network is pretty much in two halves - the server room and core 3850 are in the middle of the building, and the two 2960X switch closets each have pretty much half of the network devices in the building on each of them, and this problem has been happening in both switch closets (sort of alternating, but not with any immediately discernible pattern). Seems unlikely this this one building (out of 4 very similar building networks) would suddenly develop problems on more than one HP switch in different parts of the building at the same time.
I just did two things - I upgraded the 3850 core switch to cat3k_caa-universalk9.16.06.08.SPA.bin. I also made sure both spanning tree and HP's loop protection is enabled on all edge ports on the HP switches.
10-08-2020 08:56 AM
Recap: I've updated the firmware on the 3850 core and the 2960X stacks, and I also updated our Meraki AP firmware from an older stable release (26.6.1) to the current stable release (27.5), so I could eliminate that as any possible source of issues (FYI we have been running Meraki firmware 26.6.1 on our other networks that aren't showing any issues). I also disconnected the HP switches that were connected to the 2960X stacks. Unfortunately none of that made any difference, and today one of the closets stopped passing traffic again mid-morning. Fortunately I was there at the time, and had just recorded a bunch of info from that closet. Specifically I ran these commands:
show processes cpu sorted 5sec
show controllers cpu-interface
show platform port-asic stats drop
show memory summary
show mac address-table
show ip arp
show interfaces stats
show interfaces switching
I have captured the output of those commands run before the stack acted up in "tny2960-a1 before.txt" and then the output of the commands while the stack was acting up in "tny2960-a1 during.txt". Any insights would be greatly appreciated.
10-08-2020 02:30 PM
Can you try pulling the power cables out of the 3850 (cold boot)?
So pull the power cables out, wait for all the "spinning" noise to end, count to five (seconds), and put the power back on.
See if this helps.
10-08-2020 02:58 PM
I'll try that, but keep in mind I recently updated the firmware in that switch to cat3k_caa-universalk9.16.06.08.SPA.bin so it did go through a full reboot cycle a few days ago. Also, what kind of problem would the 3850 have that would be solved by rebooting a 2960X stack each time? Unplugging and replugging the fiber trunk connection between a problem 2960X stack and the 3850 makes no difference while it's happening, but rebooting the 2960X stack makes a difference. I collected this info from the fiber connection going from the 3850 to the 2960X stack that acted up this morning while the problem was ongoing:
TNY3850-mdf#show interface Te1/1/1 summary *: interface is up IHQ: pkts in input hold queue IQD: pkts dropped from input queue OHQ: pkts in output hold queue OQD: pkts dropped from output queue RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec) TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec) TRTL: throttle count Interface IHQ IQD OHQ OQD RXBS RXPS TXBS TXPS TRTL ----------------------------------------------------------------------------------------------------------------- * Te1/1/1 2 0 0 0 211000 215 317000 184 0 TNY3850-mdf#show interface Te1/1/1 transceiver detail ITU Channel not available (Wavelength not available), Transceiver is internally calibrated. mA: milliamperes, dBm: decibels (milliwatts), NA or N/A: not applicable. ++ : high alarm, + : high warning, - : low warning, -- : low alarm. A2D readouts (if they differ), are reported in parentheses. The threshold values are calibrated. High Alarm High Warn Low Warn Low Alarm Temperature Threshold Threshold Threshold Threshold Port (Celsius) (Celsius) (Celsius) (Celsius) (Celsius) --------- ----------------- ---------- --------- --------- --------- Te1/1/1 29.7 75.0 70.0 0.0 -5.0 High Alarm High Warn Low Warn Low Alarm Voltage Threshold Threshold Threshold Threshold Port (Volts) (Volts) (Volts) (Volts) (Volts) --------- ----------------- ---------- --------- --------- --------- Te1/1/1 3.26 3.63 3.46 3.13 2.97 High Alarm High Warn Low Warn Low Alarm Current Threshold Threshold Threshold Threshold Port (milliamperes) (mA) (mA) (mA) (mA) --------- ----------------- ---------- --------- --------- --------- Te1/1/1 41.6 80.0 74.0 12.0 10.0 Optical High Alarm High Warn Low Warn Low Alarm Transmit Power Threshold Threshold Threshold Threshold Port (dBm) (dBm) (dBm) (dBm) (dBm) --------- ----------------- ---------- --------- --------- --------- Te1/1/1 -2.0 3.4 0.4 -6.4 -10.5 Optical High Alarm High Warn Low Warn Low Alarm Receive Power Threshold Threshold Threshold Threshold Port (dBm) (dBm) (dBm) (dBm) (dBm) --------- ----------------- ---------- --------- --------- --------- Te1/1/1 -4.3 3.4 0.4 -8.4 -12.4 TNY3850-mdf#show interface Te1/1/1 capabilities TenGigabitEthernet1/1/1 Model: WS-C3850-48P Type: SFP-10GBase-LRM Speed: 10000 Duplex: full Trunk encap. type: 802.1Q Trunk mode: on,off,desirable,nonegotiate Channel: yes Broadcast suppression: percentage(0-100) Flowcontrol: rx-(off,on,desired),tx-(none) Fast Start: yes QoS scheduling: rx-(not configurable on per port basis), tx-(2p6q3t) CoS rewrite: yes ToS rewrite: yes UDLD: yes Inline power: no SPAN: source/destination PortSecure: yes Dot1x: yes TNY3850-mdf#show idprom interface Te1/1/1 General SFP Information ----------------------------------------------- Identifier : SFP/SFP+ Ext.Identifier : SFP function is defined by two-wire interface ID only Connector : LC connector Transceiver 10GE Comp code : Unknown SONET Comp code : OC 48 short reach GE Comp code : Unknown Link length : Unknown Technology : Unknown Media : Single Mode Speed : Unknown Encoding : 64B/66B BR_Nominal : 10300 Mbps Length(9um)-km : GBIC does not support single mode fibre Length(9um) : GBIC does not support single mode fibre Length(50um) : 220 m Length(62.5um) : 220 m Length(Copper) : GBIC does not support 50 micron multi mode OM4 fibre Vendor Name : CISCO-AVAGO Vendor Part Number : SFBR-7600SDZ-CS3 Vendor Revision : 0x31 0x2E 0x34 0x20 Vendor Serial Number : AGD1831V189 Wavelength : 1310 nm CC_BASE : 0x0E ----------------------------------------------- Extended ID Fields ----------------------------------------------- Options : 0x01 0x1A BR, max : 0x00 BR, min : 0x00 Date code : 140803 Diag monitoring : Implemented Internally calibrated : Yes Exeternally calibrated: No Rx.Power measurement : Avg.Power Address Change : Not Required CC_EXT : 0x17 ----------------------------------------------- Other Information ----------------------------------------------- Chk for link status : 00 Flow control Receive : ON Flow control Send : ON Administrative Speed : 10000 Administrative Duplex : full Operational Speed : 10000 Operational Duplex : full ----------------------------------------------- SEEPROM contents (hex): 0x00: 03 04 07 40 00 00 00 00 00 00 00 06 67 00 00 00 0x10: 16 16 00 16 43 49 53 43 4F 2D 41 56 41 47 4F 20 0x20: 20 20 20 20 00 00 17 6A 53 46 42 52 2D 37 36 30 0x30: 30 53 44 5A 2D 43 53 33 31 2E 34 20 05 1E 00 0E 0x40: 01 1A 00 00 41 47 44 31 38 33 31 56 31 38 39 20 0x50: 20 20 20 20 31 34 30 38 30 33 20 20 68 F0 03 17 0x60: 00 00 06 E2 CA 89 47 2F 29 7F 76 7B F5 DD 8B 4C 0x70: D4 5F 03 00 00 00 00 00 00 00 00 00 EE 46 F8 B3 -----------------------------------------------
10-08-2020 07:55 PM
Warm reboot (using the "reload" command) and cold reboot (pull the power cable) are different from each other.
But for the 3650/3850 (and lately the 9300), there is something I want to "catch" that will only get "fixed" in a cold reboot.
And if cold reboot does not fixes the issue then you've just proven me wrong.
10-20-2020 08:25 AM - edited 10-20-2020 08:26 AM
Leo, so far it looks like you were right! Since cold booting the 3850, the problem has NOT come back - it's been a week of no trouble with the downstream switch stacks. What in the world was going on that a cold boot solved? Is there a better solution?
I'm now wondering if a version of this problem also periodically affects our server VMs... We have VM hosts (which were ESXi, recently we migrated to Proxmox) at our sites with the 3850 core switches, and those VM hosts are plugged directly into the 3850s. About once every couple of weeks at some of these sites a VM will be unable to reach the network. No problem with the VM host or other VMs on the host. The VM is running fine while the problem is going on (much like our 2960X switch stacks) and the 'fix' has been to either reset the virtual network adapter in the VM or soft reboot the VM. There is no pattern to the problem but it has only been affecting our 3850 core switch sites - our high school has a Catalyst WS-C6509-E as its core switch and the VMs there don't ever have this problem.
10-20-2020 02:41 PM
@nnraymond wrote:
What in the world was going on that a cold boot solved?
IOS-XE, which the 3850 runs on, has one "hidden" flaw: The "reload" command does not have the same "effect" as the classic IOS. Sure, it reloads the switch but there are some process that is not meant to "die" during this reload process. Cold reboot, however, is a totally different world. When a cold reboot is performed, all bets are off and nothing will "survive" it. All process has to stop regardless.
If you have the time to read the Release Notes to various versions for switches running IOS-XE, you'll notice a pattern of bugs which the workaround (NOTE: I said "workaround" and not a "fix) is a cold reboot of the stack or affected appliance. And my two "favorite" bugs are the "MOSFET twins" : CSCvj76259 & CSCvd46008.
MOSFET is a transistor component that is, currently, present in Cisco switches running IOS-XE (3650/3850 and Catalyst 9k). Over time, the MOSFET component degrades. And when this happens, the ports simply stops "talking: No PoE, no traffic. Nothing. The only way to "prove" it is MOSFET is a cold reboot. If cold reboot "wakes up" the port(s), then the switches are affected by the bugs. Specifically to MOSFET twins, cold-reboot is only a workaround. This is not a permanent fix. Over time, regardless of whatever firmware the switch/stack is running on, this will come back. Troubleshooting MOSFET is a PITA because you need two people and time. One person on the (remote) console and another person to move the connection(s) to different parts of the stack.
Apologies for the long response.
10-29-2020 07:39 AM
I appreciate the length of the response. A couple of questions - the two bugs you mentioned seem to be related to PoE, but my case, this is happening over 10Gbit SFP+ fiber transceivers. Can these MOSFET bugs impact those as well? Since a cold reboot boot is just a workaround (and in fact the problem just came back today for the first time after the cold boot you recommended), what is the long-term fix? Replace the 3850 with another 3850, or avoid the 3850 and get a different model entirely?
10-29-2020 02:35 PM
3650/3850 as well as the Catalyst 9k switching platform have several well-known "features" (euphemism for bugs) where the optical ports simply stops talking. The reload command does not make any difference, however, the workaround is to COLD REBOOT the entire appliance or stack.
Again, my emphasis is the importance of the cold reboot vs the more traditional "reload" command.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide