Nexus 7706 - Supervisor bootup issue (N77-SUP2E)

Quintin.Mayo · ‎02-13-2020

Hi,

We were in the process of upgrading to version 8.4.1 and ended up running into a issue. We initiated the upgrade on the core. During the software and system integrity check a problem was detected. The system informed me that there had been multiple major In service software upgrades performed, and that a binary reload was recommended. I performed the reload, which took approximately 15 minutes to fully execute. Once done the primary supervisor module did not come back online. The secondary supervisor module did come up at this time and we were up and operational. I then made my way over to the DC to find the primary supervisor module in alarm. I then performed a manual restart of the primary supervisor module. It took approximately 21 minutes for the primary supervisor module to fully reload and come back online. I would like to know why the supervisor module failed to initialize the first time. I found the below information in the syslog any assistance would be greatly appreciated.

Syslog Messages

2020 Feb 12 06:01:22 M_VDC %SYSMGR-2-STANDBY_BOOT_FAILED: Standby supervisor failed to boot up.
2020 Feb 12 06:03:53 M_VDC %IM-5-IM_INTF_STATE: mgmt0 is UP in vdc 1
2020 Feb 12 06:11:02 M_VDC %USBHSD-STANDBY-2-MOUNT: slot0: online
2020 Feb 12 06:11:02 M_VDC %USBHSD-STANDBY-2-MOUNT: logflash: online
2020 Feb 12 06:05:48 M_VDC %BOOTVAR-5-NEIGHBOR_UPDATE_AUTOCOPY: auto-copy supported by neighbor supervisor, starting...
2020 Feb 12 06:10:01 M_VDC %BOOTVAR-5-AUTOCOPY_FAILED: Autocopy of file /bootflash/n7700-s2-kickstart.8.3.2.bin to standby failed. Read-only file system (Error-id: 0x807b001e)
2020 Feb 12 06:10:03 M_VDC %BOOTVAR-5-AUTOCOPY_FAILED: Autocopy of file /bootflash/n7700-s2-dk9.8.3.2.bin to standby failed. Read-only file system (Error-id: 0x807b001e)
2020 Feb 12 06:37:35 M_VDC %PLATFORM-2-PFM_MODULE_RESET: Manual restart of Module 3 from Command Line Interface
2020 Feb 12 06:37:38 M_VDC %PLATFORM-2-MOD_REMOVE: Module 3 removed (Serial number )
2020 Feb 12 06:48:05 M_VDC %SYSMGR-2-STANDBY_BOOT_FAILED: Standby supervisor failed to boot up.
2020 Feb 12 06:54:58 M_VDC %USBHSD-STANDBY-2-MOUNT: slot0: online
2020 Feb 12 06:54:58 M_VDC %USBHSD-STANDBY-2-MOUNT: logflash: online
2020 Feb 12 06:55:04 M_VDC %BOOTVAR-5-NEIGHBOR_UPDATE_AUTOCOPY: auto-copy supported by neighbor supervisor, starting...
2020 Feb 12 06:56:20 M_VDC %PLATFORM-1-PFM_ALERT: Disabling ejector based shutdown on sup in slot 3
2020 Feb 12 06:58:21 M_VDC %MODULE-5-STANDBY_SUP_OK: Supervisor 3 is standby
2020 Feb 12 06:58:22 M_VDC %PLATFORM-1-PFM_ALERT: Enabling ejector based shutdown on sup in slot 4

Thanks

Mark Malone · ‎02-13-2020

Hi
if its crashed really crashed your going to need TAC on a NX 7706 , it should of generated a file that gets dumped in the core as below , that snippet of the log above is not showing why it started or what caused it unfortunately

from there you can collect the file and pass it to TAC who have a system that can read it and pinpoint whether it was hardware or software fault that caused the crash , these tools are not available to teh public

When a process has an unexpected restart or failure, Cisco NX-OS saves a core file that contains details about the event. The content in a core file is useful for Cisco TAC engineers and software developers to diagnose the process failure. The core files should be copied and attached to the TAC case. The following commands determine if there are any core files and copies them to a remote destination. This example uses SCP, but other transport protocols such as SFTP, FTP or TFTP can be used.

n7000# show cores

VDC No Module-num Process-name PID Core-create-time

------ ---------- ------------ --- ----------------

1 8 acltcam 285 Oct 27 09:32

n7000# copy core://8/285 scp://username@x.x.x.x/acltcam-core

If you dont have support , you could check what the last reset was in show version , it may show something , or if there are more syslog files please post them can take a look

Last reset at 366317 usecs after Thu Jun 20 12:26:03 2019
Reason: Reset Requested by CLI command reload

Quintin.Mayo · ‎02-13-2020

M_VDC# show cores
VDC Module Instance Process-name PID Date(Year-Month-Day Time)
--- ------ -------- --------------- -------- -------------------------
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC#
M_VDC# show ver
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Documents: http://www.cisco.com/en/US/products/ps9372/tsd_products_support_series_home.html
Copyright (c) 2002-2018, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php

Software
BIOS: version 3.2.0
kickstart: version 8.3(2)
system: version 8.3(2)
BIOS compile time: 09/27/2018
kickstart image file is: bootflash:///n7700-s2-kickstart.8.3.2.bin
kickstart compile time: 11/30/2018 12:00:00 [12/14/2018 16:34:58]
system image file is: bootflash:///n7700-s2-dk9.8.3.2.bin
system compile time: 11/30/2018 12:00:00 [12/14/2018 18:03:21]

Hardware
cisco Nexus7700 C7706 (6 Slot) Chassis ("Supervisor Module-2")
Intel(R) Xeon(R) CPU with 32939272 kB of memory.
Processor Board ID JAE20220408

Device name: M_VDC
bootflash: 4014080 kB
slot0: 0 kB (expansion flash)

Kernel uptime is 1 day(s), 4 hour(s), 38 minute(s), 1 second(s)

Last reset
Reason: Unknown
System version: 8.3(2)
Service:

plugin
Core Plugin, Ethernet Plugin

Active Package(s)
M_VDC#

Quintin.Mayo · ‎02-13-2020

Hi,

There was nothing in the show core. I have attached a copy of the command and the show version. When the issue happen I performed a reload of the box. and then the primary sup did not come back online. It failed to the secondary and then i had to do a manual reload of the primary sup. I also received message “Multiple Major ISSU have been performed on this switch. We recommend doing a binary reload instead of upgrading” Do you want to continue with the installation (y/n)? [n] .

Mark Malone · ‎02-13-2020

Hi

Do you have support i would really TAC this if possible , you could also try a command show system reset-reason

There is a tool also called the cli analyser on the Cisco website its free you can dump a show tech into it from a device and it can check for problems or recent issues think it supports nx-os the latest one but if the whole system was rebooted you may not find the cause as it could have cleared the error but worth a shot if you dont have TAC support

marce1000 · ‎02-13-2020

- Could you also execute dir bootflash: in order to check the integrity of the bootflash on both supervisors ?

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '