09-22-2013 02:59 PM - edited 03-01-2019 05:59 AM
This document will help to resolve fabric issues reported in the Cisco Nexus7000 platform.
This document covers the most common types of fabric CRC errors. Troubleshooting fabric CRCs requires data collecting, data analysis, and then performing a process of eliminating to isolate the most-likely failing component.
The “General CRC troubleshooting guideline” section given below will establish a general framework for troubleshooting these issues. Then the case study sections will provide some examples on how a similar problem could be troubleshot. And finally the” monitoring fabric CRCs” section defines an alternative way for detecting and monitoring fabric CRCs.
II. Fabric CRC detection overview:
Legend:
Stage1 (S1), Stage2 (S2) and Stage3 (S3) are the three stages of the Nexus7000 fabric.
Octopus is the Queue Engine
Santa Cruz (SC) is the Fabric ASIC
Instance 1 and 2 are the two Santa Cruz instances on the XBAR.
The above is an overview of the components involved when a packet traverses fabric. To make it simple, this document considers only one XBAR. Please keep in mind that most of the Nexus7000 switches have three or more XBARs installed.
Assuming a unidirectional flow from Module #1 to Module #2, the ingress Octopus-1 on Mod 1 will perform error checking on packets it receives from the south, and the egress Octopus-1 on Mod 2 from the north. If CRC is detected in stage 3. problem could have happened in stage 1 or stage 2, too, since no CRC check is done in those stages. So the devices in involved in the path are the ingress Octopus, chassis, crossbar fabric, and egress Octopus.
In M1/Fab1 architecture, CRCs are detected only on the egress linecard (S3).
Sample Error Message:
%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with CRC error from MOD 15 through XBAR slot 1/inst 1
The above message is reported by module 1, indicating it received packets with wrong CRC from Module 15 via xbar 1/instance 1
III. Understanding different Fabric CRC errors:
(1) CRC error with single source module, receive module, and XBAR instance
%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with CRC error from MOD 15 through XBAR slot 1/inst 1
This means that the module in slot 1 detected a CRC error coming from module 15 through XBAR 1/instance 1. Going forward we will refer to the module where the CRC errors were coming from as the ingress module (mod 15 in this case) and the module that reported the problem as the egress module (mod 1). XBAR #1 is the cross bar we received the packet through. There are two instances per XBAR, so in this case module 1 detected CRC errors coming from module 15 through XBAR 1 instance 1
(2) CRC error with single source module, receive module, but no XBAR instance
%OC_USD-SLOT4-2-RF_CRC: OC2 received packets with CRC error from MOD 1
In this message, module 4 reported the CRC error coming from module 1. You will notice that the XBAR info is missing, why ? The system is unable to ascertain the XBAR the packet traversed. There are many reasons, but the most common are that first the information in the fabric header of the packet could be corrupt, so the source module can’t be determined. Second, the XBAR that was traversed has been removed from the system since the error incremented. Thus it wasn’t reported in the hourly syslog message.
(3) CRC error with no receive module
%OC_USD-2-RF_CRC: OC1 received packets with CRC error from MOD 16 through XBAR slot 1/inst 1
Here, some device detected a CRC from Module 16 through XBAR 1. There is, however, no receiver module, why ? When the SUP detects a CRC coming from the fabric the slot info is not logged. So when you see no slot info then the SUP detected the problem. Does this mean that the SUP is bad ? Not necessarily, just like with a module reporting the problem you have multiple components that could have caused the problem: module 16, the chassis (not as likely), XBAR 1, or the SUP.
(4) CRC error with multiple possible source modules
%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from MOD 11 or 12 or 14 or 15 or 16 or 17 or 18
The source module is gleaned from learning the ingress Octopus that sourced the bad packet. The driver that raises an interrupt to log this error message does not always know the ingress Octopus the bad packet originated from. This is because some of the bits being used to represent the ingress Octopus are not used.If the system determines multiple modules might have these unused bits turned on the system has to assume anyone of them could be the source and as a result we include all of those modules in the error message.The system found that module 13 couldn’t have this conflict due to those bits not being used, thus it wasn’t logged as a potential source.
IV. Fabric CRC Troubleshooting approach:
New linecards (M2) and fabric2 (FAB2) detect CRCs in S1, S2 or S3 making it much easier to isolate the faulty component.
Investigating in detail and finding pattern in the failure and log messages will help to isolate the faulty component
Some of the questions to ask:
Answers to the above questions should allow you to approach troubleshooting from an angle which is more likely to lead to faster resolution.
V. General CRC Troubleshooting Guidelines:
Logs:
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
Problem:
For the last few hours CRC errors have been seen on module 1 and 3 coming from Module 7 and Module 7 only.
Most likely cause of the problem:
1. There is a bad or mis-seated XBAR corrupting packets going to module 7
2. Module 7 is bad or mis-seated
Process to isolate the faulty component:
If you have three XBARs installed this gives you N+1 redundancy.Therefore, you should be able to shut them down one at a time (and never more than one shut at any given time) with only minimal impact to see if the problem goes away.
N7K(config)# poweroff xbar 1
<monitor>
N7K(config)# no poweroff xbar 1
N7K(config)# poweroff xbar 2
<monitor>
N7K(config)# no poweroff xbar 2
N7K(config)# poweroff xbar 3
<monitor>
N7K(config)# no poweroff xbar 3
In this particular case study, shutting down the XBARs did not resolve the problem.
As there are two modules reporting CRC errors, it is unlikely that two modules reporting the errors (mod 1 & 3) are the cause. Our next step then should be to reseat module 7 (ingress module) , because it is the most likely faulty component. Mis-seated linecards can cause this problem, and it is recommended to reseat the module before replacement.
After reseating module 7, and monitoring we still find that CRC errors are incrementing on the fabric. A Cisco TAC should be opened (but can always be opened earlier) at this point to replace/EFA module 7 since a reseat didn’t resolve the problem.
In our case study, the replacement of module 7 stopped the fabric crc error messages and the packet loss our customer was seeing.
(2) Mis-seated XBAR injecting corrupt packets
Logs:
%OC_USD-SLOT11-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT12-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT13-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT15-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT2-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT4-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT5-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT7-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT8-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
Problem:
Multiple modules are reporting CRC errors from Module 12 going through the XBAR 3.
Most likely cause of the problem:
1. XBAR 3 is bad or mis-seated
2. Module 12 is mis-seated or faulty
Process to isolate the faulty component:
1. Shutdown XBAR 3 and monitor
2. Reseat the ingress module 12 and monitor
3. Replace module 12 and monitor
In our case, we shutdown XBAR 3, using the procedure previously described (in first case study), and monitored for further errors. It was found that errors ceased when XBAR 3 was shutdown. At this point, XBAR 3 was reseated, taking care to ensure that no pins are bent on the midplane and that the module is properly inserted. After re-enabling XBAR 3 it was found that the problem has never reoccurred. This problem can be attributed to a mis-seated XBAR module.
(3) Faulty Egress module corrupts packets from the Fabric
Logs:
%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from MOD 1 or 2 or 7 or 13 or 17 through XBAR slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from MOD 1 or 2 or 3 or 7 or 15 or 17 through XBAR slot 2/inst 1 and slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from MOD 1 or 2 or 5 or 7 or 16 or 17 through XBAR slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1
Problem:
Module 6 is reporting packets with CRC errors being received from multiple linecards and XBARs
Most likely cause of the problem:
Module 6 is mis-seated or bad
Process to isolate the faulty component:
Module 6 is the most likely cause of the fault because it is the one common module in all the errors messages. Of all the modules listed in the error messages. the one that most consistently shows up is module 6. Therefore, we try to reseat module 6 to see if that resolves the issue before replacing it.
In our case, we reseated module 6 but the errors still persisted. So the next step should be to open a TAC Case to have module 6 replaced. And after replacing module 6 the errors were not reported any longer.
VII. Troubleshooting Commands:
Some of the commands used to troubleshoot/debug:
show clock
sh mod xbar
show hardware fabric-utilization detail
show hardware fabric-utilization detail timestamp
show hardware internal xbar-driver all event-history errors
show hardware internal xbar-driver all event-history msgs
show system internal xbar-client internal event-history msgs
show system internal xbar all
show module internal event-history xbar 1
show module internal activity xbar 1
show module internal event-history xbar 2
show module internal activity xbar 2
show module internal event-history xbar 3
show module internal activity xbar 3
show module internal event-history xbar 4
show module internal activity xbar 4
show module internal event-history xbar 5
show module internal activity xbar 5
show logging onboard internal xbar
show logging onboard internal octopus
show tech detail
For comments and feedback, please email Yogesh at yramdoss@cisco.com
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: