I. Introduction

Yogesh Ramdoss · ‎09-22-2013

I. Introduction

This document will help to resolve fabric issues reported in the Cisco Nexus7000 platform.

This document covers the most common types of fabric CRC errors. Troubleshooting fabric CRCs requires data collecting, data analysis, and then performing a process of eliminating to isolate the most-likely failing component.

The “General CRC troubleshooting guideline” section given below will establish a general framework for troubleshooting these issues. Then the case study sections will provide some examples on how a similar problem could be troubleshot. And finally the” monitoring fabric CRCs” section defines an alternative way for detecting and monitoring fabric CRCs.

II. Fabric CRC detection overview:

High-level diagram of Nexus 7018 Fabric with M1 linecards

Legend:

Stage1 (S1), Stage2 (S2) and Stage3 (S3) are the three stages of the Nexus7000 fabric.

Octopus is the Queue Engine

Santa Cruz (SC) is the Fabric ASIC

Instance 1 and 2 are the two Santa Cruz instances on the XBAR.

The above is an overview of the components involved when a packet traverses fabric. To make it simple, this document considers only one XBAR. Please keep in mind that most of the Nexus7000 switches have three or more XBARs installed.

Assuming a unidirectional flow from Module #1 to Module #2, the ingress Octopus-1 on Mod 1 will perform error checking on packets it receives from the south, and the egress Octopus-1 on Mod 2 from the north. If CRC is detected in stage 3. problem could have happened in stage 1 or stage 2, too, since no CRC check is done in those stages. So the devices in involved in the path are the ingress Octopus, chassis, crossbar fabric, and egress Octopus.

In M1/Fab1 architecture, CRCs are detected only on the egress linecard (S3).

Sample Error Message:

%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with CRC error from MOD 15 through XBAR slot 1/inst 1

The above message is reported by module 1, indicating it received packets with wrong CRC from Module 15 via xbar 1/instance 1

III. Understanding different Fabric CRC errors:

(1) CRC error with single source module, receive module, and XBAR instance

%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with CRC error from MOD 15 through XBAR slot 1/inst 1

This means that the module in slot 1 detected a CRC error coming from module 15 through XBAR 1/instance 1. Going forward we will refer to the module where the CRC errors were coming from as the ingress module (mod 15 in this case) and the module that reported the problem as the egress module (mod 1). XBAR #1 is the cross bar we received the packet through. There are two instances per XBAR, so in this case module 1 detected CRC errors coming from module 15 through XBAR 1 instance 1

(2) CRC error with single source module, receive module, but no XBAR instance

%OC_USD-SLOT4-2-RF_CRC: OC2 received packets with CRC error from MOD 1

In this message, module 4 reported the CRC error coming from module 1. You will notice that the XBAR info is missing, why ? The system is unable to ascertain the XBAR the packet traversed. There are many reasons, but the most common are that first the information in the fabric header of the packet could be corrupt, so the source module can’t be determined. Second, the XBAR that was traversed has been removed from the system since the error incremented. Thus it wasn’t reported in the hourly syslog message.

(3) CRC error with no receive module

%OC_USD-2-RF_CRC: OC1 received packets with CRC error from MOD 16 through XBAR slot 1/inst 1

Here, some device detected a CRC from Module 16 through XBAR 1. There is, however, no receiver module, why ? When the SUP detects a CRC coming from the fabric the slot info is not logged. So when you see no slot info then the SUP detected the problem. Does this mean that the SUP is bad ? Not necessarily, just like with a module reporting the problem you have multiple components that could have caused the problem: module 16, the chassis (not as likely), XBAR 1, or the SUP.

(4) CRC error with multiple possible source modules

%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from MOD 11 or 12 or 14 or 15 or 16 or 17 or 18

The source module is gleaned from learning the ingress Octopus that sourced the bad packet. The driver that raises an interrupt to log this error message does not always know the ingress Octopus the bad packet originated from. This is because some of the bits being used to represent the ingress Octopus are not used.If the system determines multiple modules might have these unused bits turned on the system has to assume anyone of them could be the source and as a result we include all of those modules in the error message.The system found that module 13 couldn’t have this conflict due to those bits not being used, thus it wasn’t logged as a potential source.

IV. Fabric CRC Troubleshooting approach:

New linecards (M2) and fabric2 (FAB2) detect CRCs in S1, S2 or S3 making it much easier to isolate the faulty component.

Investigating in detail and finding pattern in the failure and log messages will help to isolate the faulty component

Some of the questions to ask:

Was the error message a one-time event or have multiple CRC error messages been logged ?
How frequent are the CRC error messages being logged ? Do we see them every hour, once a day, once a month, etc.
Are the CRC errors ALL coming from the same ingress module ?
Are the CRC errors ALL reported on the same egress module ?
Are the CRC errors coming from multiple ingress modules AND reported on multiple egress modules ?
If multiple modules are reporting CRC errors is there a common source module or XBAR module ?

Answers to the above questions should allow you to approach troubleshooting from an angle which is more likely to lead to faster resolution.

V. General CRC Troubleshooting Guidelines:

Find the common modules (including XBARs) that are reported in the Fabric CRC error messages.
Build a theory and test it. That is, after finding the common modules pick the most likely cause of the problem and shutdown (in case of XBAR), move it to a known good slot, reseat, and replace it while monitoring to see if the problem goes away.
Shutdown/reseat/replace modules one at a time. This makes it easier to isolate the faulty part.
When you shutdown, move, reseat, or replace a part look for any changes in the problem’s symptom. You may have to revise your action plan after you learn more from each step taken.
If multiple parts have been replaced and the problem still persists: (1) the new parts could be bad so test them in a working slot (2) multiple XBARs could be bad (3) a bad chassis slot could be the cause.

VI. Case Studies:

(1) Ingress module corrupting the packets

Logs:

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

Problem:

For the last few hours CRC errors have been seen on module 1 and 3 coming from Module 7 and Module 7 only.

Most likely cause of the problem:

1. There is a bad or mis-seated XBAR corrupting packets going to module 7

2. Module 7 is bad or mis-seated

Process to isolate the faulty component:

Shutdown XBARs one-by-one monitoring to see if the problem goes away
Reseat the ingress module 7 and monitor
Replace the module 7 and monitor

If you have three XBARs installed this gives you N+1 redundancy.Therefore, you should be able to shut them down one at a time (and never more than one shut at any given time) with only minimal impact to see if the problem goes away.

N7K(config)# poweroff xbar 1

N7K(config)# no poweroff xbar 1

N7K(config)# poweroff xbar 2

N7K(config)# no poweroff xbar 2

N7K(config)# poweroff xbar 3

N7K(config)# no poweroff xbar 3

In this particular case study, shutting down the XBARs did not resolve the problem.

As there are two modules reporting CRC errors, it is unlikely that two modules reporting the errors (mod 1 & 3) are the cause. Our next step then should be to reseat module 7 (ingress module) , because it is the most likely faulty component. Mis-seated linecards can cause this problem, and it is recommended to reseat the module before replacement.

After reseating module 7, and monitoring we still find that CRC errors are incrementing on the fabric. A Cisco TAC should be opened (but can always be opened earlier) at this point to replace/EFA module 7 since a reseat didn’t resolve the problem.

In our case study, the replacement of module 7 stopped the fabric crc error messages and the packet loss our customer was seeing.

(2) Mis-seated XBAR injecting corrupt packets

Logs:

%OC_USD-SLOT11-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT12-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT13-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT15-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT2-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT4-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT5-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT7-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT8-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

Problem:

Multiple modules are reporting CRC errors from Module 12 going through the XBAR 3.

Most likely cause of the problem:

1. XBAR 3 is bad or mis-seated

2. Module 12 is mis-seated or faulty

Process to isolate the faulty component:

1. Shutdown XBAR 3 and monitor

2. Reseat the ingress module 12 and monitor

3. Replace module 12 and monitor

In our case, we shutdown XBAR 3, using the procedure previously described (in first case study), and monitored for further errors. It was found that errors ceased when XBAR 3 was shutdown. At this point, XBAR 3 was reseated, taking care to ensure that no pins are bent on the midplane and that the module is properly inserted. After re-enabling XBAR 3 it was found that the problem has never reoccurred. This problem can be attributed to a mis-seated XBAR module.

(3) Faulty Egress module corrupts packets from the Fabric

Logs:

%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from MOD 1 or 2 or 7 or 13 or 17 through XBAR slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from MOD 1 or 2 or 3 or 7 or 15 or 17 through XBAR slot 2/inst 1 and slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from MOD 1 or 2 or 5 or 7 or 16 or 17 through XBAR slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1

Problem:

Module 6 is reporting packets with CRC errors being received from multiple linecards and XBARs

Most likely cause of the problem:

Module 6 is mis-seated or bad

Process to isolate the faulty component:

Reseat module 6 and monitor
Replace module 6 and monitor

Module 6 is the most likely cause of the fault because it is the one common module in all the errors messages. Of all the modules listed in the error messages. the one that most consistently shows up is module 6. Therefore, we try to reseat module 6 to see if that resolves the issue before replacing it.

In our case, we reseated module 6 but the errors still persisted. So the next step should be to open a TAC Case to have module 6 replaced. And after replacing module 6 the errors were not reported any longer.

VII. Troubleshooting Commands:

Some of the commands used to troubleshoot/debug:

show clock

sh mod xbar

show hardware fabric-utilization detail

show hardware fabric-utilization detail timestamp

show hardware internal xbar-driver all event-history errors

show hardware internal xbar-driver all event-history msgs

show system internal xbar-client internal event-history msgs

show system internal xbar all

show module internal event-history xbar 1

show module internal activity xbar 1

show module internal event-history xbar 2

show module internal activity xbar 2

show module internal event-history xbar 3

show module internal activity xbar 3

show module internal event-history xbar 4

show module internal activity xbar 4

show module internal event-history xbar 5

show module internal activity xbar 5

show logging onboard internal xbar

show logging onboard internal octopus

show tech detail

For comments and feedback, please email Yogesh at yramdoss@cisco.com

Nexus7000: Understanding and Troubleshooting Fabric CRC Errors

I. Introduction

High-level diagram of Nexus 7018 Fabric with M1 linecards

VI. Case Studies:

(1) Ingress module corrupting the packets