I built a stack of two C9300-24T switches yesterday to replace a pair of Catalyst 3560X's. Very simple cut-&-paste configuration with some HSRP IPv4 addresses being removed and the HSRP VIP's added as the physical addresses.
About an hour after the swap out of the 3560X's to the C9300's it just stopped working. Everything connected through the switch at L3 just stopped.
There are two CPE routers connected on a /29 subnet running HSRP - one connected to switch #1 and one connected to switch #2. HSRP and L2 connectivity between these two CPE routers was working, however they couldn't ping the SVI interface on the C9300.
From a 2960X L2 switch connected on a trunk to this stack I could see CDP information for the C9300, however I couldn't the other way around from the C9300 to the 2960X. From the L2 2960X I couldn't ping the C9300, from the C9300 I couldn't ping anything.
After troubleshooting for an hour from the console and not getting anywhere I just rebooted it and everything came back. I checked the stacking cables and they were all finger tight so don't think its just a loose stacking cable. I left if for a couple of hours and everything was OK.
However looking at our monitoring platform this morning at 6:15am it failed again. I can logon to the CPEs and HSRP is OK between them, they can ping each other but they can't ping the SVI interface on the switch stack. I suspect a reboot of the C9300 will temporarily will fix this, but it will reoccur.
Has anyone seen this behaviour? Luckily there are no staff on site today.
May be ARP ? (if the default 14400 seconds) - may be you would have clear the Table rather reboot.
any way - hope resolved as per the reboot, that might have cleared ARP or time expired co-incident..
what is the version of the Code on Cat 9300 ? what is the Logs show ?
May be worth connecting device to Cat 9300 do debug ?
Its not ARP. It worked all night and failed at 6:15 this morning.
I think its either a hardware or a stacking issue.
From the two CPE's there is no ARP response for the switch SVI, however the two CPEs can reach each other through the VLAN on the C9300 stack (VLAN 3000), and HSRP is working OK.
I put another pair of these in with very similar configuration at one of the customers other sites and they are working fine.
I have raised a TAC case.
As you mentioned you moved the config from old to new, May be i will put come time, some commands when you moved messing up (i am thinking)
also mentioned new one simple config works fine, so suspect config issue i guess.
Do you any Logs in the switch ?
Note : 17.3.4 is latest Code
It worked all night without issues. Its not configuration.
The configuration is very simple - VTP off, Rapid-PVST+, several VLANs and SVI interfaces, some trunks to ESXi boxes, 2 x 2 x 1Gbps port-channel trunks to two access C2960Xs, static default route to HSRP VIP on the CPE routers, ip routing enabled. That's pretty much it (AAA, NTP etc all the usual stuff).
It is either hardware or a stacking issue.
The switch stack doesn't halt. It responds fine on the console, it just doesn't have any connectivity.
There was nothing logged. The stack was reloaded from the console with the reload command.
I don't have access to it now as its not responding on any IPv4 address.
I am just trying to work out if I can configure reverse telnet on a C1100 console port so I can get a console connected to the switch. Doesn't look like reverse telnet works on the C1100 series though....
Just managed to get back to site.
switch#sho platform software status control-processor brief Load Average Slot Status 1-Min 5-Min 15-Min 1-RP0 Healthy 0.21 0.41 0.29 2-RP0 Healthy 0.41 0.91 0.54 Memory (kB) Slot Status Total Used (Pct) Free (Pct) Committed (Pct) 1-RP0 Healthy 7757632 2863576 (37%) 4894056 (63%) 3820908 (49%) 2-RP0 Healthy 7757632 2809492 (36%) 4948140 (64%) 3242004 (42%) CPU Utilization Slot CPU User System Nice Idle IRQ SIRQ IOwait 1-RP0 0 1.79 0.49 0.00 97.60 0.00 0.09 0.00 1 0.90 0.50 0.00 98.50 0.00 0.10 0.00 2 1.00 0.80 0.00 98.19 0.00 0.00 0.00 3 1.00 0.60 0.00 98.39 0.00 0.00 0.00 4 0.80 0.50 0.00 98.69 0.00 0.00 0.00 5 1.20 0.70 0.00 98.10 0.00 0.00 0.00 6 1.00 0.50 0.00 98.49 0.00 0.00 0.00 7 1.40 0.50 0.00 98.10 0.00 0.00 0.00 2-RP0 0 1.10 0.90 0.00 98.00 0.00 0.00 0.00 1 0.90 0.50 0.00 98.60 0.00 0.00 0.00 2 1.00 0.30 0.00 98.69 0.00 0.00 0.00 3 1.40 0.40 0.00 98.20 0.00 0.00 0.00 4 0.80 0.70 0.00 98.50 0.00 0.00 0.00 5 1.20 0.50 0.00 98.30 0.00 0.00 0.00 6 1.00 0.50 0.00 98.50 0.00 0.00 0.00 7 1.20 0.90 0.00 97.89 0.00 0.00 0.00
When switch #1 is active it can't even ping its own SVI interfaces. L2 is working fine - i.e. 'show mac address table' shows learnt MACs. However LACP won't work and the switch can't see any CDP neighbours, although they can see it.
I am convinced this is a faulty switch.
Nothing looks suspicious.
Can I see the output to the following commands:
I replaced the switch I suspected faulty (switch #1) with a spare I had and its now working so I am fairly sure its faulty hardware (chip shortage, corner cuts maybe?).
I now have the 'faulty' switch at home and I'll do some testing with it tomorrow and update the TAC case.
It appears that any traffic that should reach the 'control-plane', isn't - i.e. STP, ARP, CDP, LACP. L2 forwarding appears to be the only thing the switch is doing.
Seems a bizarre hardware fault.
So this turned out to be a faulty stacking cable. Of the redundantly cabled 2-switch stack, one of the four stacking ports was reporting CRC errors (show switch stack-port detail). This stack was working OK, however after a period of time this then caused the issue that we were experiencing. Simply replacing the cable didn't resolve the issue once it happens and all control-plane functions stop working. It doesn't even recognise the cable was replaced.
Removing and restoring the power (with the faulty cable replaced) got it back working. It has now been working OK for 3-days and no CRC errors reported on the stack ports so I think its solved, however it leaves an unanswered question.
I have got the cable at home and have been testing it on the stack I have here temporarily. It is definitely the cable that is faulty. CRC errors are observed immediately after booting.
I have updated the TAC case with the details and asked why a faulty stacking cable would cause such a catastrophic fault.