Ok, after hours of testing, I think this sums up the issues I'm having with fiber SFPs on my 9300L (both 24P-4G and 48P-4Gs).
It does not seem to do this with GLC-T copper SFPs just all fiber SFPS (Cisco 1000BaseLX, 1000BaseSX either GLC-LH-SMD++= or GLC-LH-SM= or GLC-SX-MMD++=).
If you power up 2 new 9300L switches disconnected, and then connect them together, they work.
The issue is that if you power up the 9300L switches connected together and they start talking before their startup configs hit, it puts their connected fiber SFP ports in a sort of limbo. Neither will talk to each other, worse yet, they will each hold this limbo until a port state change from both sides (shut/no-shut). Unplugging and re-plugging the fiber from either side also works as a state change for both.
It will do this with brand new switches with the factory config or on fully configured and baselined switches under 17.3.x or 17.6.x.
Adding "speed nonegotiate" in mix only confuses the issue, I think this is because it firsts acts like a state change if the switch is running, so it can make it work, but following a reboot it behaves differently with the bug. If all the switches have the command and they all start back up at the same time, it still doesn't work, but now even shut/no-shutting won't work until you remove the command off each side.
It can look like a solution, because if you apply it to each side and the state change makes it work, subsequent reboots will work if you only test it by rebooting one of the switches at a time since it looks like the issue happens before the configs hit, but after the ports come up, this doesn't work if they are both power cycled at the same time.
The only thing that works consistently are shutting/no-shutting both sides (which is hard to do when you can only reach one through the other). And making sure you don’t have "speed nonegotiate" on either side of the devices with the bug.
Unplugging and re-plugging one of the ends from the SFPs will always bring the link back up regardless.
I came up with this EEM script until I can find a fix for this issue. It checks for a neighbor on the configured link every 5 minutes and if none exists, it bounces the port. Applying this to both sides solves the problem when the devices in question are hours away and connected to each other. It shouldn't do anything if there is already a neighbor (other than an annoying log entry about an unknown user running enable).
no event manager applet check_neighbor_1/1/1
event manager applet check_neighbor_1/1/1
event timer watchdog time 300
action 1.0 cli command "enable"
action 2.0 cli command "show cdp neighbors GigabitEthernet1/1/1 detail | include Device ID"
action 3.0 regexp "Device ID: .*" "$_cli_result"
action 4.0 if $_regexp_result eq "0"
action 4.5 syslog priority errors msg "No neighbor found, bouncing port 1/1/1"
action 5.0 cli command "config t"
action 6.0 cli command "interface GigabitEthernet1/1/1"
action 7.0 cli command "shutdown"
action 8.0 cli command "no shutdown"
action 9.0 cli command "end"
action 9.5 end
It works for me, please don’t blame me if it doesn’t work or screws up everything in your network, you have been warned. Obviously use the correct interface name and not just “GigabitEthernet1/1/1” for your setup.
Am I the only person to experience this with the C9300L-48P-4G / C9300L-24P-4Gs?
BOOTLDR: System Bootstrap, Version 17.8.1r[FC2], RELEASE SOFTWARE (P)
17.03.06, 17.06.04, 17.06.05 - CAT9K_IOSXE - INSTALL
So I decided to try the other recommended release 17.9.3 and it looks like it does work correctly, so now I can sleep at night again without fear of power failures.
For any engineers at Cisco reading this, after the MCU upgrade to 17.9.1r, but while still running 17.8.1r[FC2], but running IOS-XE 17.09.03 (basically on the first reload following the IOS and MCU update), the problem went away.
I thought this went away with the fixing of bug CSCvy40384, it clearly did not.
Whatever new way of initializing the SFPs Cisco is doing under 17.09.03 it also fixes attached 9300Ls running 3.6.5 too.
I don't know if it's because 17.09.03 takes longer to boot or what.
So no combination of 17.3.x and 17.6.x would work together following a shared power failure, but it looks like those will work as long as one of the switches is running 17.09.03.
Hopfully this bug stays squashed unlike the other.
BOOTLDR: System Bootstrap, Version 17.9.1r, RELEASE SOFTWARE (P)
Cisco still should put that bug back as a known caveat for the 17.3.x and 17.6.x trains.
I'd be recommending upgrading to 17.9.3. There is no more motivation to do any improvements to 17.3.X or 17.6.X train.
NOTE: 17.9.4 comes out at the end of July 2023.
The picture below is a stack made up of 6 x 9300. Since running 17.6.4, I have had to reboot the stack every 3 to 4 months due to a memory leak caused by SNMP polling. However, since upgrading to 17.9.3, the memory leak has drastically slowed down.
Just one last issue with 17.6.5 so that it doesn't catch anyone else off guard, it also seems to not copy the contents of an upgrade to the rest of a stack.
In that if you have a stack and it's running 17.6.5 and you want to upgrade to 17.9.3, if you execute a "request platform software package install switch all file flash:cat9k_iosxe.17.09.03.SPA.bin on-reboot new auto-copy verbose" command it will say it's extracting and copying the contents of the .bin to the rest of the stack but it doesn't and if you don't catch the last error and actually reboot, it will drop every other member of the stack to rommon.