N7K-M132XP-12/L: Need to be RMAed or not ?

Yogesh Ramdoss · ‎11-13-2012

Introduction

The purpose of this document is to provide a step-by-step process to determine if a N7K-M132XP-12 or N7K-M132XP-12L module in Nexus7000 switch need to be RMAed or not.

This document lists the symptoms and relevant troubleshooting steps in determining the health of the module.

Scenario 1: N7K-M132XP-12 or N7K-M132XP-12L “TestPortLoopback” diagnostic test failed

Symptoms:

Diag failure, the following syslog is observed:

2011 Dec 10 11:47:11 %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:18 TestPortLoopback failed 10 consecutive times. Faulty module:Module 18 affected ports:23 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC

N7K2# show diagnostic result module 18

Current bootup diagnostic level: complete

Module 18: 10 Gbps Ethernet Module

Test results: (. = Pass, F = Fail, I = Incomplete,

U = Untested, A = Abort, E = Error disabled)

1) EOBCPortLoopback--------------> .

2) ASICRegisterCheck-------------> E

3) PrimaryBootROM----------------> .

4) SecondaryBootROM--------------> .

5) PortLoopback:

Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-----------------------------------------------------

U U I I I I I I U U I . I . I .

Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

-----------------------------------------------------

U U . . U U E . U U I I I I I I

6) RewriteEngineLoopback:

Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

-----------------------------------------------------

. . . . . . . . . . . . . . . .

Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

-----------------------------------------------------

. . . . . . . . . . . . . . . .

N7K2# show module

Mod Ports Module-Type Model Status

--- ----- -------------------------------- ------------------ ------------

16 32 10 Gbps Ethernet Module N7K-M132XP-12 ok

17 32 10 Gbps Ethernet Module N7K-M132XP-12 ok

18 32 10 Gbps Ethernet Module N7K-M132XP-12 ok

<snip>

Mod Online Diag Status

--- ------------------

16 Fail

17 Pass

18 Fail

Checklist:

This is likely due to CSCtn81109 or CSCti95293

To verify if you hit these bugs, perform the following checks:

(1) Check NX-OS version if match with ddts found version. Both bugs are fixed and verified in 5.2(4) and later releases.

(2) Understand when diag message was observed, “show log” will give the time stamp of diag test failure, then check if there is any CPU issue happening near the sametime. Sometimes when the CPU is overwhelmed it causes the diag port loopback test failed, it is not definite, but a good data point to collect.

(3) Collect additional Logs:

“tac-pac bootflash:tech.txt” <<== This command may take several minutes to complete
“show tech module 1”
“show tech gold”
“show hardware internal errors module 1 | diff -s” <<== execute this command few times

(4) We can clear the diagnostic result and re-run them while CPU is not overwhelmed:

# show diagnostic result module 1
# diagnostic clear result module all
(config)# no diagnostic monitor module 1 test 5 (check the test number using "show diagnostic content module X")
(config)# diagnostic monitor module 1 test 5
# diagnostic start module 1 test 5
# show diagnostic result module 1 test 5 (could take a few minutes before test completed)
# show module internal exceptionlog module 1
# show module internal event-history errors
# show hardware internal errors module 1

If the module is recovered and diag test pass, highly likely this is due to the DDTS' mentioned above; as actual hardware failure should fail diag consistently.

If the module failed diag consistently, Please contact Cisco TAC for further analysis.

Scenario 2: M1 modules gets reset and/or link flaps

Symptoms:

2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$ %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed 10 consecutive times. Faulty module: affected ports:3,5,7,11,13,15,19,21,23,27,29,31 Error:Loopback test failed. Packets lost on the LC at the MAC ASIC

2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$ %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed 10 consecutive times. Faulty module: affected ports:4,6,8,12,14,16,20,22,24,26,28,30,32 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC

Checklist:

This is highly likely due to CSCtt43115

It is NOT a hardware failure, and no RMA/EFA required.

Collect all the logs reported and sequence of events occurred.

- show tech detail

- show accounting log

- show logging

Please make sure the configurations (specifically SPAN) and symptoms match with that of mentioned in the DDTS Release-note enclosure.

Note: This issue is applicable to all M1 module types.

Relevant Field Notice:

http://www.cisco.com/en/US/ts/fn/634/fn63495.html

Scenario 3: N7K-M132XP-12 consistent device base process crashed

Symptoms:

Nexus7000 switch reports:

%SYSMGR-SLOT8-2-SERVICE_CRASHED: Service "lamira_usd" (PID 1944) hasn't caught signal 6 (core will be saved).

Here, Lamira is the L3/L4 forwarding engine.

The above menssage indicates that this ASIC has crashed and dumped the core.

Checklist:

In most cases, lamira_usd crash is caused by a reoccurring TCAM parity error, the fix was to RMA/EFA the card.

This checklist provides instruction on how to confirm that lamira_usd crash is caused by multiple uncorrectable TCAM parity errors, which can lead to a fast resolution to the case.

(1) Collect following logs, by attaching to the module:

# attach module <#>
module-<#># show logging onboard exception-log
module-<#># show logging onboard internal lamira

Note: We have to collect the execption-log from the onboard log. “show module internal exceptionlog module <mod>” won't suffice

(2) Note the time lamira_usd crashed

(3) Now check the onboard exception-log around the time of the LC crash (10-20 minutes)

// Five minutes before the lamira_usd crashed

Exception Log Record : Tue Jul 10 04:05:40 2012 (840169 us)

Device Id : 81

Device Name : Lamira

Device Error Code : c5101210(H)

Device Error Type : ERR_TYPE_HW

Device Error Name : NULL

Device Instance : 1

Sys Error : Generic failure

Errtype : INFORMATIONAL

PhyPortLayer : Ethernet

Port(s) Affected :

Error Description : LM_INT_CL1_TCAM_B_PARITY_ERR <== uncorrectable parity error in TCAM B

DSAP : 211

UUID : 382

Time : Tue Jul 10 04:05:40 2012

(840168 usecs 4FFC0C84(H) jiffies)

(4) Compare the TCAM exception above with the exception log record at the time of the crash

Exception Log Record : Tue Jul 10 04:17:02 2012 (270399 us) <== 12 minutes after the TCAM parity error

Device Id : 34304

Device Name : 0x8600

Device Error Code : 7e010000(H)

Device Error Type : NULL

Device Error Name : NULL

Device Instance : 0

Sys Error : (null)

Errtype : CATASTROPHIC

PhyPortLayer : 0x0

Port(s) Affected :

Error Description : lamira_usd hap reset <== lamira crashed because the number of TCAM parity errors in TCAM B

violated the HA Policy(HAP)

DSAP : 0

UUID : 16777216

Time : Tue Jul 10 04:17:02 2012

(270399 usecs 4FFC0F2E(H) jiffies)

(5) If the lamira crash was caused by HAP reset due to multiple TCAM parity errors then the LC should be RMAd/EFAd. Otherwise, if lamira crashed for some other reason, continue your normal troubleshooting. The onboard lamira log (command #2) will help Cisco TAC to root-cause it.

Scenario 4: All M1 modules fail specific diagnostic tests, like PortLoopback or RewriteEngineLoopback test

In Nexus7000, if there is any issue between active supervisor engine and an xbar module, and as a result diagnositic packets are dropped, the supeervisor engine may report diagnostic test failure for multiple/all ports in multiple/all modules.

This issue requires manual investigation and isolation of faulty sup engine.

The condition which caused the tests to go into errdisabled state may be transient.

It is recommended to run the tests on-demand and see if the condition is persistent.

First, try to clear the ErrDisabled state of the test:

N7K# diagnostic clear result module 1 test ?
<1-6> Test ID(s)
all Select all

To run on-demand test:

N7K# diagnostic start module <mod#> test <test#>

To stop the test:

N7K# diagnostic stop module <mod#> test <test#>

As a corrective action, the sup engine do not trigger failover or reset to recover from this condtion. To request corrective action, an enhancement request has been filed -

CSCth03474 - n7k/GOLD:Improve Fault Isolation of N7K-GOLD

Addtional Information:

Software Advisory on CCO:

http://www.cisco.com/web/software/datacenter/N7K/nxos_521_sw_advisory.html

If you have any comments/feedback, please send email to Yogesh (yramdoss@cisco.com) and Vincent (vng@cisco.com)