cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

Nexus 7700 module reloading due to EOBC failure

2817
Views
0
Helpful
0
Comments

 

 

Introduction

Cisco Nexus 7700 Switches delivers the highest-capacity 10, 40, and 100 Gigabit Ethernet ports in the industry, with up to 768 native 10-Gbps ports, 384 40-Gbps ports, or 192 100-Gbps ports. All Cisco Nexus 7000 Series chassis use a passive mid-plane architecture, providing physical connectors and copper traces for interconnecting the fabric modules and I/O modules for direct data transfer. All inter-module switching is performed via crossbar fabric ASICs on the individual I/O modules and fabric modules. Consolidation of multiple switches in the Cisco Nexus 7700 Switches is made possible by the high density of ports on the switches combined with high-performance device virtualization, comprehensive reliability, and availability features.


Problem

One of the modules on the switch running NX-OS version 6.2(6a) is rebooting with following messages.

2015 Mar 22 14:28:55 N7K-1 %MODULE-2-MOD_NOT_ALIVE: Module 1 not responding... resetting (Serial number: xxxx)
2015 Mar 22 14:29:13 N7K-1 %PLATFORM-2-MOD_DETECT: Module 1 detected (Serial number xxxx) Module-Type 1/10 Gbps Ethernet Module Model N77-F348XP-23
2015 Mar 22 14:29:13 N7K-1 %PLATFORM-2-MOD_PWRUP: Module 1 powered up (Serial number xxxx)
2015 Mar 22 14:29:13 N7K-1 %PLATFORM-5-MOD_STATUS: Module 1 current-status is MOD_STATUS_POWERED_UP
2015 Mar 22 14:30:50 N7K-1 %BIOS_DAEMON-SLOT1-5-BIOS_DAEMON_LC_PRI_BOOT:  System booted from Primary BIOS Flash
2015 Mar 22 14:32:33 N7K-1 %PLATFORM-5-MOD_STATUS: Module 1 current-status is MOD_STATUS_ONLINE/OK
2015 Mar 22 14:32:33 N7K-1 %MODULE-5-MOD_OK: Module 1 is online (Serial number: xxxx)
2015 Mar 22 14:32:33 N7K-1 %SYSMGR-SLOT1-5-MODULE_ONLINE: System Manager has received notification of local module becoming online.


Description

Gather output from following commands to understand reason for reload.

N7K-1# show logging onboard module 1 internal reset-reason

----------------------------
Module: 1 show clock
----------------------------
2015-04-10 10:42:44
        Last log in OBFL was written at time Fri Apr 10 09:54:19 2015

 

Reset Reason for this card:
        Image Version : 6.2(8a)
        Reset Reason (LCM): Line card not responding (60) at time Sun Mar 22 14:30:52 2015
        Reset Reason (SW): Unknown (0)
        Reset Reason (HW): System reset by active sup (by writing to PMFPGA regs) (100) at time Sun Mar
22 14:30:52 2015
        Last log in OBFL was written at time Sun Mar 22 14:22:32 2015


N7K-1# show system reset-reason module 1
*************** module reset reason (1) *************
Time stamp   : At 357350 usecs after Sun Mar 22 14:29:06 2015

Service name : Line card manager
Reset reason : Line card not responding => [Failures < MAX] : powercycle
Serial number: JAE181509B1
Error code   : NA


N7K-1# show module internal activity module 1
****************** module activity log ***************

— snip —

112) At 458025 usecs after Sun Mar 22 14:28:19 2015
    Warning on ports: 1-0 due to EOBC heartbeat failure in device 10
Device error code: 0xc0a0


113) At 457788 usecs after Sun Mar 22 14:28:19 2015
    Received MTS_OPC_LC_RUNTIME_DIAG_FROM_SUP from EOBC monitor kernel SAP

 

114) At 435569 usecs after Sun Mar 22 14:28:19 2015
    Warning on ports: 1-0 due to EOBC heartbeat failure in device 10
Device error code: 0xc0a0


115) At 435312 usecs after Sun Mar 22 14:28:19 2015
    Received MTS_OPC_LC_RUNTIME_DIAG_FROM_SUP from EOBC monitor kernel SAP

 

116) At 921833 usecs after Tue Mar 17 20:41:16 2015
    DEBUG: writing saved config (sw:153, node:258)


N7K-1# show module internal exceptionlog module 1
********* Exception info for module 1 ********

exception information --- exception instance 1 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0a0314f
Device ID         : 10 (0x0a)
Device Instance   : 03 (0x03)
Dev Type (HW/SW)  : 01 (0x01)
ErrNum (devInfo)  : 79 (0x4f)
System Errorcode  : 0x4042004e EOBC heartbeat failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  :
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Sun Mar 22 14:28:19 2015


Resolution

The EOBC is Ethernet Out of Band Channel. There are regular keepalives or 'heartbeats' going between the sup and line cards.  The error messages you received indicate a heartbeat went missing between SUP and linecard. If a single heartbeat went missing, it will be corrected automatically, however if multiple heartbeats are lost simultaneously then the line card would be reset (sup would power off and power on that slot in attempt to resolve a diagnostic issue).

After this one time reset id the card is stable then it can be considered this as a one time (transient) issue.

You can also check EOBC stats to see if any error increasing using following commands.

show hardware internal eobc stats
show hardware internal eobcsw stats
show hardware internal cpu-mac eobc stats
show hardware internal cpu-mac eobc events
show hardware capacity eobc


Related Information

Cisco Nexus 7700 Switches
cisco nexus 7706 Module x not responding resetting