cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2551
Views
10
Helpful
6
Replies

cisco nexus 7706 Module x not responding... resetting

realmatrix
Level 1
Level 1

Hello every body,

we have in our data center 2 switches 7706 running NX-OS version 6.2(6a). One of them has notified that module 1 was not responding and rebooting it.

2015 Mar 22 14:28:55 N7K-1 %MODULE-2-MOD_NOT_ALIVE: Module 1 not responding... resetting (Serial number: xxxx)
2015 Mar 22 14:29:13 N7K-1 %PLATFORM-2-MOD_DETECT: Module 1 detected (Serial number xxxx) Module-Type 1/10 Gbps Ethernet Module Model N77-F348XP-23
2015 Mar 22 14:29:13 N7K-1 %PLATFORM-2-MOD_PWRUP: Module 1 powered up (Serial number xxxx)
2015 Mar 22 14:29:13 N7K-1 %PLATFORM-5-MOD_STATUS: Module 1 current-status is MOD_STATUS_POWERED_UP
2015 Mar 22 14:30:50 N7K-1 %BIOS_DAEMON-SLOT1-5-BIOS_DAEMON_LC_PRI_BOOT:  System booted from Primary BIOS Flash
2015 Mar 22 14:32:09 N7K-1 %VDC_MGR-5-VDC_STATE_CHANGE: vdc 1 state changed to updating
2015 Mar 22 14:32:09 N7K-1 %VDC_MGR-5-VDC_STATE_CHANGE: vdc 1 state changed to active
2015 Mar 22 14:32:33 N7K-1 %PLATFORM-5-MOD_STATUS: Module 1 current-status is MOD_STATUS_ONLINE/OK
2015 Mar 22 14:32:33 N7K-1 %MODULE-5-MOD_OK: Module 1 is online (Serial number: xxxx)
2015 Mar 22 14:32:33 N7K-1 %SYSMGR-SLOT1-5-MODULE_ONLINE: System Manager has received notification of local module becoming online.

Fiew months ago we have the same symptom with module 6 instead of 1.

Note that the VPC peer-link is connected over 2x10G interfaces on each of these 2 modules.

Is there any known issue with this NX-OS version 6.2(6a)?

1 Accepted Solution

Accepted Solutions

Hi,

 

I see that the module reloaded  to recover from a EOBC heartbeat failure.

 


108) At 955385 usecs after Sun Mar 22 14:28:55 2015
    Sequence initiation: LC removal


109) At 955209 usecs after Sun Mar 22 14:28:55 2015
    Sending MTS_OPC_LC_STATUS_CHANGE to Registry


110) At 955206 usecs after Sun Mar 22 14:28:55 2015
    Received MTS_OPC_LCP_NOT_RESPONDING from Line card manager


111) At 955148 usecs after Sun Mar 22 14:28:55 2015
    Sending MTS_OPC_LCP_NOT_RESPONDING to Line card manager


112) At 458025 usecs after Sun Mar 22 14:28:19 2015
    Warning on ports: 1-0 due to EOBC heartbeat failure in device 10
Device error code: 0xc0a0


113) At 457788 usecs after Sun Mar 22 14:28:19 2015
    Received MTS_OPC_LC_RUNTIME_DIAG_FROM_SUP from EOBC monitor kernel SAP

 

114) At 435569 usecs after Sun Mar 22 14:28:19 2015
    Warning on ports: 1-0 due to EOBC heartbeat failure in device 10
Device error code: 0xc0a0


115) At 435312 usecs after Sun Mar 22 14:28:19 2015
    Received MTS_OPC_LC_RUNTIME_DIAG_FROM_SUP from EOBC monitor kernel SAP


N7K-1# show module internal exceptionlog module 1
********* Exception info for module 1 ********

exception information --- exception instance 1 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0a0314f
Device ID         : 10 (0x0a)
Device Instance   : 03 (0x03)
Dev Type (HW/SW)  : 01 (0x01)
ErrNum (devInfo)  : 79 (0x4f)
System Errorcode  : 0x4042004e EOBC heartbeat failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  :
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Sun Mar 22 14:28:19 2015
                    (Ticks: 550EC373 jiffies)

exception information --- exception instance 2 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0a0314d
Device ID         : 10 (0x0a)
Device Instance   : 03 (0x03)
Dev Type (HW/SW)  : 01 (0x01)
ErrNum (devInfo)  : 77 (0x4d)
System Errorcode  : 0x4042004e EOBC heartbeat failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  :
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Sun Mar 22 14:28:19 2015
                    (Ticks: 550EC373 jiffies)

 

The EOBC is Ethernet Out of Band Channel.  

There are regular keepalives or 'heartbeats' going between the sup and line cards.  The error messages you received indicate a heartbeat went missing between SUP and linecard. 

If a single heartbeat went missing, it will be corrected automatically, however if multiple heartbeats are lost simultaneously then the line card would be reset (sup would power off and power on that slot in attempt to resolve a diagnostic issue).

So if the card is stable after that you can consider this as a one time (transient) issue.

You can also check EOBC stats to see if any error increasing.

 

show hardware internal eobc stats

show hardware internal eobcsw stats

show hardware internal cpu-mac eobc stats

show hardware internal cpu-mac eobc events

show hardware capacity eobc

 

Hope this helps. 

PS : Please grade all posts which are useful.

 

Thanks,

Madhu.

 

 

 

View solution in original post

6 Replies 6

richbarb
Cisco Employee
Cisco Employee

Hi,

You might be encountering the bug: https://tools.cisco.com/bugsearch/bug/CSCui72592

I recommend you open a service request with TAC to make sure.

 

If it helps, please rate it.

Richard

Richard,

thx for reply. A case is already opened but with our reseller instead with cisco directly.

I executed the show command mentioned in link suggested but in my case the reset reason is "Unknown (0)". Here is the output

----------------------------
Module: 1 show clock
----------------------------
2015-03-30 09:41:05
        Last log in OBFL was written at time Mon Mar 30 09:05:06 2015

 

Reset Reason for this card:
        Image Version : 6.2(8a)
        Reset Reason (LCM): Line card not responding (60) at time Sun Mar 22 14:30:52 2015
        Reset Reason (SW): Unknown (0)
        Reset Reason (HW): System reset by active sup (by writing to PMFPGA regs) (100) at time Sun Mar
22 14:30:52 2015
        Last log in OBFL was written at time Sun Mar 22 14:22:32 2015


Hope to get further informations from our support team. then will i share here

 

Hi,

You can try collecting the below and share.

show logging onboard module 1 internal reset-reason

show system reset-reason module 1

show module internal activity module 1
show module internal exceptionlog module 1

 

Thanks,

Madhu

 

Hi Madhu,

Till now i've no feedback from our support - because of missing deice-list with support contract ;-( I'm so frustrated :@

Please find outputs requested in attached files.

Thx for your help

Mourad

Hi,

 

I see that the module reloaded  to recover from a EOBC heartbeat failure.

 


108) At 955385 usecs after Sun Mar 22 14:28:55 2015
    Sequence initiation: LC removal


109) At 955209 usecs after Sun Mar 22 14:28:55 2015
    Sending MTS_OPC_LC_STATUS_CHANGE to Registry


110) At 955206 usecs after Sun Mar 22 14:28:55 2015
    Received MTS_OPC_LCP_NOT_RESPONDING from Line card manager


111) At 955148 usecs after Sun Mar 22 14:28:55 2015
    Sending MTS_OPC_LCP_NOT_RESPONDING to Line card manager


112) At 458025 usecs after Sun Mar 22 14:28:19 2015
    Warning on ports: 1-0 due to EOBC heartbeat failure in device 10
Device error code: 0xc0a0


113) At 457788 usecs after Sun Mar 22 14:28:19 2015
    Received MTS_OPC_LC_RUNTIME_DIAG_FROM_SUP from EOBC monitor kernel SAP

 

114) At 435569 usecs after Sun Mar 22 14:28:19 2015
    Warning on ports: 1-0 due to EOBC heartbeat failure in device 10
Device error code: 0xc0a0


115) At 435312 usecs after Sun Mar 22 14:28:19 2015
    Received MTS_OPC_LC_RUNTIME_DIAG_FROM_SUP from EOBC monitor kernel SAP


N7K-1# show module internal exceptionlog module 1
********* Exception info for module 1 ********

exception information --- exception instance 1 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0a0314f
Device ID         : 10 (0x0a)
Device Instance   : 03 (0x03)
Dev Type (HW/SW)  : 01 (0x01)
ErrNum (devInfo)  : 79 (0x4f)
System Errorcode  : 0x4042004e EOBC heartbeat failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  :
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Sun Mar 22 14:28:19 2015
                    (Ticks: 550EC373 jiffies)

exception information --- exception instance 2 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0a0314d
Device ID         : 10 (0x0a)
Device Instance   : 03 (0x03)
Dev Type (HW/SW)  : 01 (0x01)
ErrNum (devInfo)  : 77 (0x4d)
System Errorcode  : 0x4042004e EOBC heartbeat failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  :
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Sun Mar 22 14:28:19 2015
                    (Ticks: 550EC373 jiffies)

 

The EOBC is Ethernet Out of Band Channel.  

There are regular keepalives or 'heartbeats' going between the sup and line cards.  The error messages you received indicate a heartbeat went missing between SUP and linecard. 

If a single heartbeat went missing, it will be corrected automatically, however if multiple heartbeats are lost simultaneously then the line card would be reset (sup would power off and power on that slot in attempt to resolve a diagnostic issue).

So if the card is stable after that you can consider this as a one time (transient) issue.

You can also check EOBC stats to see if any error increasing.

 

show hardware internal eobc stats

show hardware internal eobcsw stats

show hardware internal cpu-mac eobc stats

show hardware internal cpu-mac eobc events

show hardware capacity eobc

 

Hope this helps. 

PS : Please grade all posts which are useful.

 

Thanks,

Madhu.

 

 

 

Hi Madhu,

thank you for your help. The stats/statistics seem good withouty any mac TX errors, collisions, drop events or bad packets/CRC.

I found a similar issue but with an M2 Module - we have an F2 Module - https://tools.cisco.com/bugsearch/bug/CSCup20959

 

It should be fixed in 6.2(10).

 

Thank you

Mourad

Review Cisco Networking for a $25 gift card