Solved: Specifying file path for config Archive causes stack reboot.

no_prop4500 · ‎01-13-2020

When specifying a file path for our config archive, the standby and member switches reboot. While the active switch stays up, the other switches in the stack reboot as soon as the command is entered. If the command is not removed before the switches complete their reboot sequence, they boot into ROMmon.

Is this expected behavior, config issue or bug? Any help is greatly appreciated. Thanks!

Hardware and versions of tested stacks.

Bootloader: 16.12.2r

IOS:16.12.2 and 16.9.4

9300-48A-UXM

9300-48A-P

Here are the commands entered:

(config)# archive

(config-archive)# path flash:/Netops/Rollback/

Prior to entering the commands:

Switch# Role Mac Address Priority Version State
-------------------------------------------------------------------------------------
*1 Active XXXX.XXXX.XXXX 15 V02 Ready
2 Standby XXXX.XXXX.XXXX 5 V02 Ready
3 Member XXXX.XXXX.XXXX 1 V02 Ready

Immediately After:

Labs-1FL-SS#sh switch
Switch/Stack Mac Address : XXXX.XXXX.XXXX - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Switch# Role Mac Address Priority Version State
-------------------------------------------------------------------------------------
*1 Active XXXX.XXXX.XXXX 15 V02 Ready
2 Member 0000.0000.0000 0 V02 Removed
3 Member 0000.0000.0000 0 V02 Removed

Mark Malone · ‎01-16-2020

hmm thats a good workaround but that would suggest somethings definitely up with the software version , we have a lot of c9300 stacks running 16.6.6 and we have not seen that issue , archive and kron running on all of them , any bug that reboots a switch you should move off that version if possible

View solution in original post

Mark Malone · ‎01-13-2020

Hi
Is there any log before it reboots ?

neither release doc currently identify a bug id that matches that so it may not be known , so you will need to go to TAC or else try eh Cisco analyzer see if that gives you a bug id , or else test another image move off Gibraltar and Fuji , try Amsterdam

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9300/software/release/16-12/release_notes/ol-16-12-9300.html

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9300/software/release/16-9/release_notes/ol-16-9-9300.html

Mark Malone · ‎01-13-2020

Also is that a typo in how your running the commands as it should be path flash: just to be sure its not that

no_prop4500 · ‎01-13-2020

Thanks. Yes just typo on the post here.

no_prop4500 · ‎01-13-2020

Yeah I could not find any bugs. Here is the log output.

001288: .Jan 9 21:03:04.818: Config Sync: Bulk-sync failure due to PRC mismatch. Please check the full list of PRC failures via:
show redundancy config-sync failures prc

001289: .Jan 9 21:03:04.818: Config Sync: Starting lines from PRC file:
archive
! <submode> "archive"
- path flash:/NetOps/Rollback
! </submode> "archive"

001290: .Jan 9 21:03:04.818: Config Sync: Bulk-sync failure, Reloading Standby

001291: .Jan 9 21:03:05.825: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)
001292: .Jan 9 21:03:06.274: %RF-5-RF_RELOAD: Peer reload. Reason: Bulk Sync Failure
001293: .Jan 9 21:03:06.644: %HMANRP-5-CHASSIS_DOWN_EVENT: Chassis 3 gone DOWN!
001294: .Jan 9 21:03:06.657: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)
001295: .Jan 9 21:03:06.657: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)
001296: .Jan 9 21:03:06.657: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)
001297: .Jan 9 21:03:06.578: %STACKMGR-6-STACK_LINK_CHANGE: Switch 1 R0/0: stack_mgr: Stack port 2 on Switch 1 is down
001298: .Jan 9 21:03:06.620: %STACKMGR-6-STACK_LINK_CHANGE: Switch 2 R0/0: stack_mgr: Stack port 1 on Switch 2 is down
001299: .Jan 9 21:03:08.149: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA standby down

Mark Malone · ‎01-13-2020

My next step would be collect the show tech and run it yourself through the Cisco cli analyser available in tools section of website see if it can pinpoint an issue , if not open a TAC case
Also check each flash has not generated a crash file , the log output unfortunately only shows it doing a standard down log , not a trigger that may have caused it

vortex · ‎01-16-2020

Hi,

I just had this issue.

What i did and it seems to fix the issue was to create the archive folder on each switches flash.

If i just created it just on flash: it would reboot the other members. Since the other stack member cant save it to its local directory as it doesn't exist.

Try creating the directories on the other switches.

example

mkdir flash-1:Netops/Rollback/

mkdir flash-2:Netops/Rollback/

Mark Malone · ‎01-16-2020

hmm thats a good workaround but that would suggest somethings definitely up with the software version , we have a lot of c9300 stacks running 16.6.6 and we have not seen that issue , archive and kron running on all of them , any bug that reboots a switch you should move off that version if possible

StevenCAnderson · ‎11-23-2020

This problem appeared after I implemented config archiving on Friday on about 20 switches. 5 of those were stacked switches (sets of 2-4), of those 5, 4 stacks were IOS 16.12.4 (2 of which had this issue, one stack of 2x 9300's and one of 2x 3850's), the 5th one is a 3750 ver 12.2(44)SE5 stack of 4 (it had no issues). After adding the archive folder to the other switches in the stack I have not had any stacks reload the standby switches into ROMmon anymore.

mp1979 · ‎11-03-2023

This issue is still present on c9300, c9200, c3850, c3650. Looks like code dependent, as it hit most all of our c9300 running 17.3.4 and 17.6.5 but not all c9200 and only some c3850 and c3650 which are running multiple legacy code versions.

Workaround is crate directory in each switch stack flash or do not use folder at all in archive path configuration (just store it to root folder). Still, what would happen if you connect new switch to a stack with archive configured? Since there is no folder confgured on the new unit, it would most likely end up in rommon anyway.

I can't believe such catastrophic issue is not addressed in years - you can literally break your whole network with management plane only related configuration and since the only remediation is booting manually from rommon via console it may take very long time.

gnijs · ‎04-15-2024

Hello

just got hit by this bug also ! on 9200L-48P-4X on version 17.9.4

luckily i saw it on the first stack we implemented. i just rolled back the "no archive" and stack recovered (luckily it didn't boot into rommon, so it recovered automatically after some boots)
but i have the same remark as above: what happens when config archive is deployed in a dir and you add a switch to the stack later on ? it won't be stable until you add the directory manually. and becuase i feel implemementing config archive in the root is messy, i am going to stop using archive, too bad.

frankglez · ‎07-12-2024

This bug is still alive and kicking stack members into a reboot loop.
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvc49871

We've just hit the bug on July 5th 2024.
I'm both surprised and disappointed that Cisco has done nothing to fix this.
They are definitely declining in terms of quality and reliability