Solved: Cisco C9300-48UXM random rebooting probable ios bug

pdmarshall · ‎04-29-2020

I am having issues across the estate where the Cisco C9300-48UXM switches esp in stacks will do a random reboot usually a few weeks apart but at any time and nothing shows in the logs showing exactly why.

Apr 28 09:43:55 Cab-NLabsAC_2_RP_0 stack_mgr[13890]: %STACKMGR-1-RELOAD: Reloading due to reason Configuration mismatch

We thought that this was due to an ios bug and that was what was making the switches unstable (i.e. threshold too low, switch always triggering warning/critical, small memory leak, switch crash eventually). As we found a known software defect – CSCvn79101

But this has been the same across 3 different IOS versions we have changed and across two gold standards, the latest being 16.9.5

Switch output of the '#sh platform resources' command below.

#sh platform resources

**State Acronym: H - Healthy, W - Warning, C - Critical

Resource Usage Max Warning Critical State

----------------------------------------------------------------------------------------------------

Control Processor 1.70% 100% 5% 10% H

DRAM 2528MB(33%) 7583MB 90% 95% H

Though when I am looking through cisco docs, it is showing the percentage as 90%

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9300/software/release/16-9/command_reference/b_169_9300_cr/interface_and_hardware_commands.html#wp1934298088

Switch# show platform resources

**State Acronym: H - Healthy, W - Warning, C - Critical

Resource Usage Max Warning Critical State

----------------------------------------------------------------------------------------------------

Control Processor 7.20% 100% 90% 95% H

DRAM 2701MB(69%) 3883MB 90% 95% H

So really wanting to find out if this is a cisco IOS issue or if we have QOS setting issue.

Many thanks

Phil

Leo Laohoo · ‎04-30-2020

I've used 16.9.4 but I've never had the opportunity to try out 16.9.5.
I'm currently testing out 16.12.3 and so far it's been OK.

View solution in original post

Leo Laohoo · ‎04-29-2020

Post the complete output to the following command:

dir crashinfo-<CRASHED SWITCH MEMBER>:

pdmarshall · ‎04-29-2020

Morning Leo

Thanks for your response, I had switch 2 and 4 out of 5 switches in a stack do this yesterday.

please see the output below.

Thanks

Phil

Cab-NLabsAC#dir crashinfo-2:
Directory of crashinfo-2:/

31553 drwx 36864 Apr 29 2020 09:38:23 +01:00 tracelogs
11 -rw- 0 Jun 20 2018 10:25:07 +01:00 koops.dat
12 -rw- 410460 Aug 24 2018 23:27:10 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_0-20180824-222703.tar.gz
13 -rw- 403904 Aug 24 2018 23:34:04 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_1-20180824-223358.tar.gz
14 -rw- 572086 Aug 25 2018 01:48:41 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_4-20180825-004833.tar.gz
15 -rw- 590746 Aug 25 2018 01:52:05 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_5-20180825-005158.tar.gz
16 -rw- 809989 Aug 25 2018 02:07:36 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_0-20180825-010729.tar.gz
17 -rw- 1164655 Aug 25 2018 02:25:49 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_1-20180825-012542.tar.gz
18 -rw- 1574874 Aug 25 2018 02:48:58 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_0-20180825-014851.tar.gz
19 -rw- 1647484 Aug 25 2018 03:02:49 +01:00 RP_0_trace_archive_0-20180825-020242.tar.gz
20 -rw- 1658861 Aug 25 2018 03:03:04 +01:00 RP_0_trace_archive_1-20180825-020257.tar.gz
21 -rw- 1633446 Aug 25 2018 03:27:37 +01:00 Cab-NLabsAC_2_RP_0_trace_archive_0-20180825-022729.tar.gz
22 -rw- 1380610 Apr 28 2020 09:44:00 +01:00 system-report_2_20200428-094359-BST.tar.gz

1651507200 bytes total (1553989632 bytes free)
Cab-NLabsAC#dir crashinfo-4:
Directory of crashinfo-4:/

7889 drwx 36864 Apr 29 2020 09:34:31 +01:00 tracelogs
11 -rw- 0 Jun 20 2018 10:24:15 +01:00 koops.dat
12 -rw- 462786 Aug 25 2018 01:41:18 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_0-20180825-004111.tar.gz
13 -rw- 490520 Aug 25 2018 01:48:39 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_3-20180825-004832.tar.gz
14 -rw- 546696 Aug 25 2018 01:51:03 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_4-20180825-005056.tar.gz
15 -rw- 557237 Aug 25 2018 01:52:05 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_5-20180825-005158.tar.gz
16 -rw- 663325 Aug 25 2018 02:07:37 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_0-20180825-010731.tar.gz
17 -rw- 920736 Aug 25 2018 02:25:50 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_0-20180825-012543.tar.gz
18 -rw- 1212323 Aug 25 2018 02:48:58 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_0-20180825-014851.tar.gz
19 -rw- 1239533 Aug 25 2018 03:27:35 +01:00 Cab-NLabsAC_4_RP_0_trace_archive_0-20180825-022728.tar.gz

1651507200 bytes total (1556086784 bytes free)
Cab-NLabsAC#

Leo Laohoo · ‎04-29-2020

22 -rw- 1380610 Apr 28 2020 09:44:00 +01:00 system-report_2_20200428-094359-BST.tar.gz

Switch 2 showed signs of crash. Switch 4 didn't.

Is there a way for you to post the above file?

Kindly post the output to this command:

remote command 4 sh log on up detail

The command above will give us a hint what caused the reboot of switch 4.

pdmarshall · ‎04-29-2020

Thanks for your reply

Can you please tell me how I would get to the file, as I am unable to see it in dir, or dir flash: which is where I thought it would be, also unable to see any thing for system report from doing a ?.

system-report_2_20200428-094359-BST.tar.gz

as for the below command, I am unable to get it to work and nothing in show or privileged level either. I only get redundancy or renew as options.

remote command 4 sh log on up detail

Sorry about this.

Thanks

Phil

Leo Laohoo · ‎04-29-2020

If there is a TFTP server, try this:

copy crashinfo-2:system-report_2_20200428-094359-BST.tar.gz tftp://<TFTP IP ADDRESS>/system-report_2_20200428-094359-BST.tar.gz

Can you try this command:

sh log on switch 4 up detail

pdmarshall · ‎04-29-2020

Leo

I will try the tftp one tomorrow, but the show log is below.

Many thanks

Phil

Cab-NLabsAC#sh log on switch 4 up detail
--------------------------------------------------------------------------------
UPTIME SUMMARY INFORMATION
--------------------------------------------------------------------------------
First customer power on : 06/11/2018 13:37:09
Total uptime : 1 years 35 weeks 3 days 14 hours 55 minutes
Total downtime : 0 years 10 weeks 4 days 9 hours 22 minutes
Number of resets : 17
Number of slot changes : 1
Current reset reason : EHSA standby down
Current reset timestamp : 04/28/2020 08:49:59
Current slot : 4
Chassis type : 0
Current uptime : 0 years 0 weeks 1 days 5 hours 5 minutes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
UPTIME CONTINUOUS INFORMATION
--------------------------------------------------------------------------------
Time Stamp | Reset | Uptime
MM/DD/YYYY HH:MM:SS | Reason | years weeks days hours minutes
--------------------------------------------------------------------------------
06/11/2018 13:37:09 Reload 0 0 0 0 0
06/11/2018 13:50:08 Reload 0 0 0 0 5
06/20/2018 09:05:51 Reload 0 0 0 0 5
06/20/2018 09:13:57 Reload 0 0 0 0 5
06/20/2018 09:25:31 Reload 0 0 0 0 5
08/24/2018 22:24:39 Reload 0 0 0 0 0
08/24/2018 22:29:13 Reload 0 0 0 0 0
08/24/2018 22:36:05 Reload 0 0 0 0 0
08/25/2018 00:58:14 Reload 0 0 0 2 5
08/25/2018 01:09:59 Reload 0 0 0 0 5
08/25/2018 01:30:51 Reload 0 0 0 0 5
08/25/2018 01:39:13 Reload 0 0 0 0 0
08/25/2018 01:51:19 Reload 0 0 0 0 5
08/25/2018 02:13:53 Reload 0 0 0 0 5
08/25/2018 03:17:58 Reload 0 0 0 0 5
02/02/2020 09:24:46 Reload 1 23 0 5 5
02/02/2020 10:05:52 Reload 0 0 0 0 5
04/28/2020 08:49:59 EHSA standby down 0 12 1 22 5
--------------------------------------------------------------------------------

Cab-NLabsAC#

pdmarshall · ‎04-30-2020

Leo

Please find the system report attached.

Many thanks

Phil

Leo Laohoo · ‎04-30-2020

ReloadReason=Configuration mismatch
RET_2_RCALTS=1588063435
RET_2_RTS=09:43:55 BST Tue Apr 28 2020

So one of the file spat out this (and only this).

Apr 28 09:43:38 Cab-NLabsAC_2_RP_0 xinetd[12517]: execve /usr/binos/conf/in.telnetd.sh
Apr 28 09:43:55 Cab-NLabsAC_2_RP_0 xinetd[12986]: execve /usr/bin/rsync
Apr 28 09:43:55 Cab-NLabsAC_2_RP_0 stack_mgr[13890]: %STACKMGR-1-RELOAD: Reloading due to reason Configuration mismatch
Apr 28 09:43:56 Cab-NLabsAC_2_RP_0 kernel: LSMPI: Deregister dual stack diverter
Apr 28 09:43:57 Cab-NLabsAC_2_RP_0 pvp[14735]: %PMAN-5-EXITACTION: Process manager is exiting: reload fp action requested
Apr 28 09:43:59 Cab-NLabsAC_2_RP_0 pvp[14796]: %PMAN-5-EXITACTION: Process manager is exiting: rp processes exit with reload switch code
Apr 28 09:43:59 Cab-NLabsAC_2_RP_0 systemd[1]: agetty-iosd.service: Main process exited, code=killed, status=9/KILL
Apr 28 09:43:59 Cab-NLabsAC_2_RP_0 systemd[1]: agetty-iosd.service: Unit entered failed state.
Apr 28 09:43:59 Cab-NLabsAC_2_RP_0 systemd[1]: agetty-iosd.service: Failed with result 'signal'.

Another one is this.

04/28/2020 08:49:59 EHSA standby down 0 12 1 22 5

The above was taken from switch 4.

I think you've hit two (2) bugs.

The first bug talks about "configuration mismatch". This usually happens when a stack merge or "split brain" occurs. I cannot find the cause of the split brain.

The second bug is what happened to switch 4: CSCvi15897

You will need to raise this with TAC and get them to identify the first &/or confirm the 2nd bug.

Question: Before the crash, did someone telnet into the switch and left the telnet session running (until the active/master switch crashed)?

pdmarshall · ‎04-30-2020

Leo

Thanks for your help, and I don't believe anyone had left a session open.

And unfortunately I can't raise a TAC as these are not under support, and I am not sure they will be.

Thanks

Phil

Leo Laohoo · ‎04-30-2020

I've used 16.9.4 but I've never had the opportunity to try out 16.9.5.
I'm currently testing out 16.12.3 and so far it's been OK.