02-13-2020 07:23 AM
Hi,
We were in the process of upgrading to version 8.4.1 and ended up running into a issue. We initiated the upgrade on the core. During the software and system integrity check a problem was detected. The system informed me that there had been multiple major In service software upgrades performed, and that a binary reload was recommended. I performed the reload, which took approximately 15 minutes to fully execute. Once done the primary supervisor module did not come back online. The secondary supervisor module did come up at this time and we were up and operational. I then made my way over to the DC to find the primary supervisor module in alarm. I then performed a manual restart of the primary supervisor module. It took approximately 21 minutes for the primary supervisor module to fully reload and come back online. I would like to know why the supervisor module failed to initialize the first time. I found the below information in the syslog any assistance would be greatly appreciated.
Syslog Messages
2020 Feb 12 06:01:22 M_VDC %SYSMGR-2-STANDBY_BOOT_FAILED: Standby supervisor failed to boot up.
2020 Feb 12 06:03:53 M_VDC %IM-5-IM_INTF_STATE: mgmt0 is UP in vdc 1
2020 Feb 12 06:11:02 M_VDC %USBHSD-STANDBY-2-MOUNT: slot0: online
2020 Feb 12 06:11:02 M_VDC %USBHSD-STANDBY-2-MOUNT: logflash: online
2020 Feb 12 06:05:48 M_VDC %BOOTVAR-5-NEIGHBOR_UPDATE_AUTOCOPY: auto-copy supported by neighbor supervisor, starting...
2020 Feb 12 06:10:01 M_VDC %BOOTVAR-5-AUTOCOPY_FAILED: Autocopy of file /bootflash/n7700-s2-kickstart.8.3.2.bin to standby failed. Read-only file system (Error-id: 0x807b001e)
2020 Feb 12 06:10:03 M_VDC %BOOTVAR-5-AUTOCOPY_FAILED: Autocopy of file /bootflash/n7700-s2-dk9.8.3.2.bin to standby failed. Read-only file system (Error-id: 0x807b001e)
2020 Feb 12 06:37:35 M_VDC %PLATFORM-2-PFM_MODULE_RESET: Manual restart of Module 3 from Command Line Interface
2020 Feb 12 06:37:38 M_VDC %PLATFORM-2-MOD_REMOVE: Module 3 removed (Serial number )
2020 Feb 12 06:48:05 M_VDC %SYSMGR-2-STANDBY_BOOT_FAILED: Standby supervisor failed to boot up.
2020 Feb 12 06:54:58 M_VDC %USBHSD-STANDBY-2-MOUNT: slot0: online
2020 Feb 12 06:54:58 M_VDC %USBHSD-STANDBY-2-MOUNT: logflash: online
2020 Feb 12 06:55:04 M_VDC %BOOTVAR-5-NEIGHBOR_UPDATE_AUTOCOPY: auto-copy supported by neighbor supervisor, starting...
2020 Feb 12 06:56:20 M_VDC %PLATFORM-1-PFM_ALERT: Disabling ejector based shutdown on sup in slot 3
2020 Feb 12 06:58:21 M_VDC %MODULE-5-STANDBY_SUP_OK: Supervisor 3 is standby
2020 Feb 12 06:58:22 M_VDC %PLATFORM-1-PFM_ALERT: Enabling ejector based shutdown on sup in slot 4
Thanks
02-13-2020 08:21 AM - edited 02-13-2020 08:23 AM
Hi
if its crashed really crashed your going to need TAC on a NX 7706 , it should of generated a file that gets dumped in the core as below , that snippet of the log above is not showing why it started or what caused it unfortunately
from there you can collect the file and pass it to TAC who have a system that can read it and pinpoint whether it was hardware or software fault that caused the crash , these tools are not available to teh public
When a process has an unexpected restart or failure, Cisco NX-OS saves a core file that contains details about the event. The content in a core file is useful for Cisco TAC engineers and software developers to diagnose the process failure. The core files should be copied and attached to the TAC case. The following commands determine if there are any core files and copies them to a remote destination. This example uses SCP, but other transport protocols such as SFTP, FTP or TFTP can be used.
n7000# show cores
VDC No Module-num Process-name PID Core-create-time
------ ---------- ------------ --- ----------------
1 8 acltcam 285 Oct 27 09:32
n7000# copy core://8/285 scp://username@x.x.x.x/acltcam-core
If you dont have support , you could check what the last reset was in show version , it may show something , or if there are more syslog files please post them can take a look
Last reset at 366317 usecs after Thu Jun 20 12:26:03 2019
Reason: Reset Requested by CLI command reload
02-13-2020 11:36 AM
02-13-2020 11:39 AM
Hi,
There was nothing in the show core. I have attached a copy of the command and the show version. When the issue happen I performed a reload of the box. and then the primary sup did not come back online. It failed to the secondary and then i had to do a manual reload of the primary sup. I also received message “Multiple Major ISSU have been performed on this switch. We recommend doing a binary reload instead of upgrading” Do you want to continue with the installation (y/n)? [n] .
02-13-2020 11:58 AM
Hi
Do you have support i would really TAC this if possible , you could also try a command show system reset-reason
There is a tool also called the cli analyser on the Cisco website its free you can dump a show tech into it from a device and it can check for problems or recent issues think it supports nx-os the latest one but if the whole system was rebooted you may not find the cause as it could have cleared the error but worth a shot if you dont have TAC support
02-13-2020 10:15 AM
- Could you also execute dir bootflash: in order to check the integrity of the bootflash on both supervisors ?
M.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide