11-13-2012 10:01 AM - edited 03-01-2019 04:52 PM
Introduction
The purpose of this document is to provide a step-by-step process to determine if a N7K-M132XP-12 or N7K-M132XP-12L module in Nexus7000 switch need to be RMAed or not.
This document lists the symptoms and relevant troubleshooting steps in determining the health of the module.
Scenario 1: N7K-M132XP-12 or N7K-M132XP-12L “TestPortLoopback” diagnostic test failed
Symptoms:
Diag failure, the following syslog is observed:
2011 Dec 10 11:47:11 %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:18 TestPortLoopback failed 10 consecutive times. Faulty module:Module 18 affected ports:23 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC
N7K2# show diagnostic result module 18
Current bootup diagnostic level: complete
Module 18: 10 Gbps Ethernet Module
Test results: (. = Pass, F = Fail, I = Incomplete,
U = Untested, A = Abort, E = Error disabled)
1) EOBCPortLoopback--------------> .
2) ASICRegisterCheck-------------> E
3) PrimaryBootROM----------------> .
4) SecondaryBootROM--------------> .
5) PortLoopback:
Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-----------------------------------------------------
U U I I I I I I U U I . I . I .
Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-----------------------------------------------------
U U . . U U E . U U I I I I I I
6) RewriteEngineLoopback:
Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-----------------------------------------------------
. . . . . . . . . . . . . . . .
Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-----------------------------------------------------
. . . . . . . . . . . . . . . .
N7K2# show module
Mod Ports Module-Type Model Status
--- ----- -------------------------------- ------------------ ------------
16 32 10 Gbps Ethernet Module N7K-M132XP-12 ok
17 32 10 Gbps Ethernet Module N7K-M132XP-12 ok
18 32 10 Gbps Ethernet Module N7K-M132XP-12 ok
<snip>Mod Online Diag Status
--- ------------------
16 Fail
17 Pass
18 Fail
Checklist:
This is likely due to CSCtn81109 or CSCti95293
To verify if you hit these bugs, perform the following checks:
(1) Check NX-OS version if match with ddts found version. Both bugs are fixed and verified in 5.2(4) and later releases.
(2) Understand when diag message was observed, “show log” will give the time stamp of diag test failure, then check if there is any CPU issue happening near the sametime. Sometimes when the CPU is overwhelmed it causes the diag port loopback test failed, it is not definite, but a good data point to collect.
(3) Collect additional Logs:
(4) We can clear the diagnostic result and re-run them while CPU is not overwhelmed:
If the module is recovered and diag test pass, highly likely this is due to the DDTS' mentioned above; as actual hardware failure should fail diag consistently.
If the module failed diag consistently, Please contact Cisco TAC for further analysis.
Scenario 2: M1 modules gets reset and/or link flaps
Symptoms:
2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$ %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed 10 consecutive times. Faulty module: affected ports:3,5,7,11,13,15,19,21,23,27,29,31 Error:Loopback test failed. Packets lost on the LC at the MAC ASIC
2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$ %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed 10 consecutive times. Faulty module: affected ports:4,6,8,12,14,16,20,22,24,26,28,30,32 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC
Checklist:
This is highly likely due to CSCtt43115
It is NOT a hardware failure, and no RMA/EFA required.
Collect all the logs reported and sequence of events occurred.
- show tech detail
- show accounting log
- show logging
Please make sure the configurations (specifically SPAN) and symptoms match with that of mentioned in the DDTS Release-note enclosure.
Note: This issue is applicable to all M1 module types.
Relevant Field Notice:
http://www.cisco.com/en/US/ts/fn/634/fn63495.htmlScenario 3: N7K-M132XP-12 consistent device base process crashed
Symptoms:
Nexus7000 switch reports:
%SYSMGR-SLOT8-2-SERVICE_CRASHED: Service "lamira_usd" (PID 1944) hasn't caught signal 6 (core will be saved).
Here, Lamira is the L3/L4 forwarding engine.
The above menssage indicates that this ASIC has crashed and dumped the core.
Checklist:
In most cases, lamira_usd crash is caused by a reoccurring TCAM parity error, the fix was to RMA/EFA the card.
This checklist provides instruction on how to confirm that lamira_usd crash is caused by multiple uncorrectable TCAM parity errors, which can lead to a fast resolution to the case.
(1) Collect following logs, by attaching to the module:
Note: We have to collect the execption-log from the onboard log. “show module internal exceptionlog module <mod>” won't suffice
(2) Note the time lamira_usd crashed
(3) Now check the onboard exception-log around the time of the LC crash (10-20 minutes)
// Five minutes before the lamira_usd crashed
Exception Log Record : Tue Jul 10 04:05:40 2012 (840169 us)
Device Id : 81
Device Name : Lamira
Device Error Code : c5101210(H)
Device Error Type : ERR_TYPE_HW
Device Error Name : NULL
Device Instance : 1
Sys Error : Generic failure
Errtype : INFORMATIONAL
PhyPortLayer : Ethernet
Port(s) Affected :
Error Description : LM_INT_CL1_TCAM_B_PARITY_ERR <== uncorrectable parity error in TCAM B
DSAP : 211
UUID : 382
Time : Tue Jul 10 04:05:40 2012
(840168 usecs 4FFC0C84(H) jiffies)
(4) Compare the TCAM exception above with the exception log record at the time of the crash
Exception Log Record : Tue Jul 10 04:17:02 2012 (270399 us) <== 12 minutes after the TCAM parity error
Device Id : 34304
Device Name : 0x8600
Device Error Code : 7e010000(H)
Device Error Type : NULL
Device Error Name : NULL
Device Instance : 0
Sys Error : (null)
Errtype : CATASTROPHIC
PhyPortLayer : 0x0
Port(s) Affected :
Error Description : lamira_usd hap reset <== lamira crashed because the number of TCAM parity errors in TCAM B
violated the HA Policy(HAP)
DSAP : 0
UUID : 16777216
Time : Tue Jul 10 04:17:02 2012
(270399 usecs 4FFC0F2E(H) jiffies)
(5) If the lamira crash was caused by HAP reset due to multiple TCAM parity errors then the LC should be RMAd/EFAd. Otherwise, if lamira crashed for some other reason, continue your normal troubleshooting. The onboard lamira log (command #2) will help Cisco TAC to root-cause it.
Scenario 4: All M1 modules fail specific diagnostic tests, like PortLoopback or RewriteEngineLoopback test
In Nexus7000, if there is any issue between active supervisor engine and an xbar module, and as a result diagnositic packets are dropped, the supeervisor engine may report diagnostic test failure for multiple/all ports in multiple/all modules.
This issue requires manual investigation and isolation of faulty sup engine.
The condition which caused the tests to go into errdisabled state may be transient.
It is recommended to run the tests on-demand and see if the condition is persistent.
First, try to clear the ErrDisabled state of the test:
N7K# diagnostic clear result module 1 test ?
<1-6> Test ID(s)
all Select all
To run on-demand test:
N7K# diagnostic start module <mod#> test <test#>
To stop the test:
N7K# diagnostic stop module <mod#> test <test#>
As a corrective action, the sup engine do not trigger failover or reset to recover from this condtion. To request corrective action, an enhancement request has been filed -
CSCth03474 - n7k/GOLD:Improve Fault Isolation of N7K-GOLD
Software Advisory on CCO:
http://www.cisco.com/web/software/datacenter/N7K/nxos_521_sw_advisory.html
If you have any comments/feedback, please send email to Yogesh (yramdoss@cisco.com) and Vincent (vng@cisco.com)
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: