The purpose of this document is to provide a step-by-step process to determine if a N7K-M132XP-12 or N7K-M132XP-12L module in Nexus7000 switch need to be RMAed or not.
This document lists the symptoms and relevant troubleshooting steps in determining the health of the module.
Scenario 1: N7K-M132XP-12 or N7K-M132XP-12L “TestPortLoopback” diagnostic test failed
Diag failure, the following syslog is observed:
2011 Dec 10 11:47:11 %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:18 TestPortLoopback failed 10 consecutive times. Faulty module:Module 18 affected ports:23 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC
N7K2# show diagnostic result module 18
Current bootup diagnostic level: complete
Module 18: 10 Gbps Ethernet Module
Test results: (. = Pass, F = Fail, I = Incomplete,
To verify if you hit these bugs, perform the following checks:
(1) Check NX-OS version if match with ddts found version. Both bugs are fixed and verified in 5.2(4) and later releases.
(2) Understand when diag message was observed, “show log” will give the time stamp of diag test failure, then check if there is any CPU issue happening near the sametime. Sometimes when the CPU is overwhelmed it causes the diag port loopback test failed, it is not definite, but a good data point to collect.
(3) Collect additional Logs:
“tac-pac bootflash:tech.txt” <<== This command may take several minutes to complete
“show tech module 1”
“show tech gold”
“show hardware internal errors module 1 | diff -s” <<== execute this command few times
(4) We can clear the diagnostic result and re-run them while CPU is not overwhelmed:
# show diagnostic result module 1
# diagnostic clear result module all
(config)# no diagnostic monitor module 1 test 5 (check the test number using "show diagnostic content module X")
(config)# diagnostic monitor module 1 test 5
# diagnostic start module 1 test 5
# show diagnostic result module 1 test 5 (could take a few minutes before test completed)
# show module internal exceptionlog module 1
# show module internal event-history errors
# show hardware internal errors module 1
If the module is recovered and diag test pass, highly likely this is due to the DDTS' mentioned above; as actual hardware failure should fail diag consistently.
If the module failed diag consistently, Please contact Cisco TAC for further analysis.
Scenario 2: M1 modules gets reset and/or link flaps
2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$ %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed 10 consecutive times. Faulty module: affected ports:3,5,7,11,13,15,19,21,23,27,29,31 Error:Loopback test failed. Packets lost on the LC at the MAC ASIC
2012 Jun 13 15:51:30 MDT Q93-7010-A %$ VDC-1 %$ %DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL: Module:3 TestPortLoopback failed 10 consecutive times. Faulty module: affected ports:4,6,8,12,14,16,20,22,24,26,28,30,32 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC
Error Description : LM_INT_CL1_TCAM_B_PARITY_ERR <== uncorrectable parity error in TCAM B
DSAP : 211
UUID : 382
Time : Tue Jul 10 04:05:40 2012
(840168 usecs 4FFC0C84(H) jiffies)
(4) Compare the TCAM exception above with the exception log record at the time of the crash
Exception Log Record : Tue Jul 10 04:17:02 2012 (270399 us) <== 12 minutes after the TCAM parity error
Device Id : 34304
Device Name : 0x8600
Device Error Code : 7e010000(H)
Device Error Type : NULL
Device Error Name : NULL
Device Instance : 0
Sys Error : (null)
Errtype : CATASTROPHIC
PhyPortLayer : 0x0
Port(s) Affected :
Error Description : lamira_usd hap reset <== lamira crashed because the number of TCAM parity errors in TCAM B
violated the HA Policy(HAP)
DSAP : 0
UUID : 16777216
Time : Tue Jul 10 04:17:02 2012
(270399 usecs 4FFC0F2E(H) jiffies)
(5) If the lamira crash was caused by HAP reset due to multiple TCAM parity errors then the LC should be RMAd/EFAd. Otherwise, if lamira crashed for some other reason, continue your normal troubleshooting. The onboard lamira log (command #2) will help Cisco TAC to root-cause it.
Scenario 4: All M1 modules fail specific diagnostic tests, like PortLoopback or RewriteEngineLoopback test
In Nexus7000, if there is any issue between active supervisor engine and an xbar module, and as a result diagnositic packets are dropped, the supeervisor engine may report diagnostic test failure for multiple/all ports in multiple/all modules.
This issue requires manual investigation and isolation of faulty sup engine.
The condition which caused the tests to go into errdisabled state may be transient.
It is recommended to run the tests on-demand and see if the condition is persistent.
First, try to clear the ErrDisabled state of the test:
N7K# diagnostic clear result module 1 test ? <1-6> Test ID(s) all Select all
To run on-demand test:
N7K# diagnostic start module <mod#> test <test#>
To stop the test:
N7K# diagnostic stop module <mod#> test <test#>
As a corrective action, the sup engine do not trigger failover or reset to recover from this condtion. To request corrective action, an enhancement request has been filed -
CSCth03474 - n7k/GOLD:Improve Fault Isolation of N7K-GOLD
Hello I have Avocent DSR2020 and use it to control VGA of my servers.My question is can I use this KVM to control RJ45 Console port of Cisco NEXUS switch?And if I can do I need some avocent module or just use RJ45 rollover cable Thank you
I finally found a function for the compliance tool and wanted to use it. I experimented with it earlier this year on a PI 3.4 installation. But considered it hard and stiff to use, and didn't get the expected results. Sinds that time I installed a ne...
Hello, we have running two Cat 4510 with four SUP8-E. The SUPs in slot 6 in both chassis are stopped in ROMMON and VSL was working fine and all ports are up and correctly aggregated.In chassis 1 are ports Te1/5/1 and 1/6/1 bundled in etherchannel 64....
Hi, The VDSL controller on the cisco is configured withoperating mode autoCurrently client use following profile:Trained Mode: G.993.2 (VDSL2) Profile 35bDoes the migration from VDSL profile to G.Fast will cause the resynchronization of the...