cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

Nexus7000: Understanding and Troubleshooting Fabric CRC Errors

1444
Views
5
Helpful
0
Comments

I. Introduction

This document will help to resolve fabric issues reported in the Cisco Nexus7000 platform.

This  document covers the most common types of fabric CRC errors.  Troubleshooting fabric CRCs requires data collecting, data analysis, and  then performing a process of eliminating to isolate the most-likely  failing component.

The  “General CRC  troubleshooting guideline” section given below will  establish a general  framework for troubleshooting these issues. Then  the case study sections  will provide some examples on how a similar  problem could be  troubleshot.  And finally the” monitoring fabric CRCs”  section defines  an alternative way for detecting and monitoring fabric  CRCs.

II. Fabric CRC detection overview:

High-level diagram of Nexus 7018 Fabric with M1 linecards

xbar.png


Legend:

Stage1 (S1), Stage2 (S2) and Stage3 (S3) are the three stages of the Nexus7000 fabric.

Octopus is the Queue Engine

Santa Cruz  (SC) is the Fabric ASIC

Instance 1 and 2 are the two Santa Cruz instances on the XBAR.

The  above is an overview of the components involved when a packet traverses  fabric.  To make it  simple, this document considers only one XBAR.  Please keep in mind that  most of the Nexus7000 switches have three or  more XBARs installed.

Assuming  a  unidirectional flow from Module #1 to Module #2, the ingress  Octopus-1  on Mod 1 will perform error checking on packets it receives  from the  south, and the egress Octopus-1 on Mod 2 from the north. If  CRC is  detected in stage 3. problem could have happened in stage 1 or  stage 2,  too, since no CRC check is done in those stages. So the devices in involved in the path are the ingress Octopus, chassis, crossbar fabric, and egress Octopus.

In M1/Fab1 architecture, CRCs are detected only on the egress linecard (S3).

Sample Error Message:

%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with CRC error from MOD 15 through XBAR slot 1/inst 1

The above message is reported by module 1, indicating it received packets with wrong CRC from Module 15 via xbar 1/instance 1

III. Understanding different Fabric CRC errors:

(1) CRC error with single source module, receive module, and XBAR instance

%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with CRC error from MOD 15 through XBAR slot 1/inst 1

This  means that the module in slot 1 detected a CRC error coming from module  15 through XBAR 1/instance 1.  Going forward we will refer to the  module where the CRC errors were  coming from as the ingress module (mod  15 in this case) and the module  that reported the problem as the  egress module (mod 1).  XBAR #1 is the  cross bar we received the packet  through.  There are two instances per  XBAR, so in this case module 1  detected CRC errors coming from module 15  through XBAR 1 instance 1

(2) CRC error with single source module, receive module, but no XBAR instance

%OC_USD-SLOT4-2-RF_CRC: OC2 received packets with CRC error from MOD 1

In  this message,  module 4 reported the CRC error coming from module 1.   You will notice  that the XBAR info is missing, why ?  The system is  unable to ascertain  the XBAR the packet traversed. There are many  reasons, but the most  common are that first the information in the  fabric header of the packet  could be corrupt, so the source module  can’t be determined. Second, the  XBAR that was traversed has been  removed from the system since the  error incremented.  Thus it wasn’t  reported in the hourly syslog  message.

(3) CRC error with no receive module

%OC_USD-2-RF_CRC: OC1 received packets with CRC error from MOD 16 through XBAR slot 1/inst 1

Here,  some device  detected a CRC from Module 16 through XBAR 1. There is,  however, no  receiver module, why ? When the SUP detects a CRC coming  from the fabric  the slot info is not logged.  So when you see no slot  info then the SUP  detected the problem.  Does this mean that the SUP is  bad ? Not  necessarily, just like with a module reporting the problem  you have  multiple components that could have caused the problem:   module 16, the  chassis (not as likely), XBAR 1, or the SUP.

(4) CRC error with multiple possible source modules

%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from MOD 11 or 12 or 14 or 15 or 16 or 17 or 18

The  source module is gleaned from learning the ingress Octopus that sourced  the bad packet. The  driver that raises an interrupt to log this error  message does not  always know the ingress Octopus the bad packet  originated from. This is  because some of the bits being used to  represent the ingress Octopus are  not used.If the system determines  multiple modules might have these  unused bits turned on the system has  to assume anyone of them could be  the source and as a result we include  all of those modules in the error  message.The system found that module  13 couldn’t have this conflict due  to those bits not being used, thus  it wasn’t logged as a potential  source.

IV. Fabric CRC Troubleshooting approach:

New linecards (M2) and fabric2 (FAB2) detect CRCs in S1, S2 or S3 making it much easier to isolate the faulty component.

Investigating in detail and finding pattern in the failure and log messages will help to isolate the faulty component

Some of the questions to ask:

  • Was the error message a one-time event or have multiple CRC error messages been logged ?
  • How frequent are the CRC error messages  being logged ? Do we see them every hour, once a day, once a month, etc.
  • Are the CRC errors ALL coming from the same ingress module ?
  • Are the CRC errors ALL reported on the same egress module ?
  • Are the CRC errors coming from multiple ingress modules AND reported on multiple egress modules ?
  • If multiple modules are reporting CRC errors is there a common source module or XBAR module ?

Answers  to the above  questions should allow you to approach troubleshooting  from an angle  which is more likely to lead to faster resolution.

V. General CRC Troubleshooting Guidelines:


  1. Find the common modules (including XBARs) that are reported in the Fabric CRC error messages.
  2. Build  a theory and  test it. That is, after finding the common modules pick  the most likely  cause of the problem and shutdown (in case of XBAR),  move it to a known  good slot, reseat, and replace it while monitoring  to see if the problem  goes away.
  3. Shutdown/reseat/replace modules one at a time. This makes it easier to isolate the faulty part.
  4. When  you shutdown,  move, reseat, or replace a part look for any changes in  the problem’s  symptom.  You may have to revise your action plan after  you learn more  from each step taken.
  5. If  multiple parts  have been replaced and the problem still persists:  (1)  the new parts  could be bad so test them in a working slot (2) multiple  XBARs could be  bad (3) a bad chassis slot could be the cause.

VI. Case Studies:

(1) Ingress module corrupting the packets

Logs:

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7

%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

Problem:

For the last few hours CRC errors have been seen on module 1 and 3 coming from Module 7 and Module 7 only.

Most likely cause of the problem:

1.  There is a bad or mis-seated XBAR corrupting packets going to module 7

2.  Module 7 is bad or mis-seated

Process to isolate the faulty component:


  1. Shutdown XBARs one-by-one monitoring to see if the problem goes away
  2. Reseat the ingress module 7 and monitor
  3. Replace the module 7 and monitor

If  you have three XBARs installed this gives you N+1 redundancy.Therefore,  you should be able to shut them down one at a time (and never more than  one shut at any given time) with only minimal impact to see if the  problem goes away.

N7K(config)# poweroff xbar 1

<monitor>

N7K(config)# no poweroff xbar 1

N7K(config)# poweroff xbar 2

<monitor>

N7K(config)# no poweroff xbar 2

N7K(config)# poweroff xbar 3

<monitor>

N7K(config)# no poweroff xbar 3

In this particular case study, shutting down the XBARs did not resolve the problem.

As  there are two modules reporting CRC errors, it is unlikely that two  modules reporting the errors (mod 1 & 3) are the cause. Our  next  step then should be to reseat module 7 (ingress module) , because  it is  the most likely faulty component. Mis-seated linecards can cause  this  problem, and it is recommended to reseat the module before replacement.

After  reseating  module 7, and monitoring we still find that CRC errors are  incrementing  on the fabric. A Cisco TAC should be opened (but can  always be opened  earlier) at this point to replace/EFA module 7 since a  reseat didn’t  resolve the problem.

In  our case study,  the replacement of module 7 stopped the fabric crc  error messages and  the packet loss our customer was seeing.

(2)  Mis-seated XBAR injecting corrupt packets

Logs:

%OC_USD-SLOT11-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT12-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT13-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT15-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT2-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT4-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT5-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT7-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

%OC_USD-SLOT8-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

Problem: 

Multiple modules are reporting CRC errors from Module 12 going through the    XBAR 3.

Most likely cause of the problem:

1.  XBAR 3 is bad or mis-seated

2.  Module 12 is mis-seated or faulty

Process to isolate the faulty component:

1.  Shutdown XBAR 3 and monitor

2.  Reseat the ingress module 12 and monitor

3.  Replace  module 12 and monitor

In  our case, we  shutdown XBAR 3, using the procedure previously described  (in first case  study), and monitored for further errors.  It was found  that errors  ceased when XBAR 3 was shutdown. At this point, XBAR 3 was  reseated,  taking care to ensure that no pins are bent on the midplane  and that the  module is properly inserted.  After re-enabling XBAR 3 it  was found  that the problem has never reoccurred. This problem can be attributed to a mis-seated XBAR module.

(3) Faulty Egress module corrupts packets from the Fabric

Logs:

%OC_USD-SLOT6-2-RF_CRC:   OC1 received packets with CRC error from MOD 1 or 2 or 7 or 13 or 17   through XBAR  slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC:   OC2 received packets with CRC error from MOD 1 or 2 or 3 or 7 or 15 or   17 through XBAR slot 2/inst 1 and slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC:   OC1 received packets with CRC error from MOD 1 or 2 or 5 or 7 or 16 or   17 through XBAR slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1

Problem:

Module 6 is reporting packets with CRC errors being received from multiple linecards and XBARs

Most likely cause of the problem:

Module 6 is mis-seated or bad

Process to isolate the faulty component:


  1. Reseat module 6 and monitor
  2. Replace module 6 and monitor

Module  6 is the most  likely cause of the fault because it is the one common  module in all  the errors messages. Of all the modules listed in the  error messages. the one  that most consistently shows up is module 6.   Therefore, we try to  reseat module 6 to see if that resolves the issue  before replacing it.

In  our case, we  reseated module 6 but the errors still persisted. So the  next step  should be to open a TAC Case to have module 6 replaced.  And  after replacing module 6 the errors were not reported any longer.

VII. Troubleshooting Commands:

Some of the commands used to troubleshoot/debug:

show clock

sh mod xbar

show hardware fabric-utilization detail 

show hardware fabric-utilization detail timestamp

show hardware internal xbar-driver all event-history errors

show hardware internal xbar-driver all event-history msgs

show system internal xbar-client internal event-history msgs

show system internal xbar all

show module internal event-history xbar 1

show module internal activity xbar 1

show module internal event-history xbar 2

show module internal activity xbar 2

show module internal event-history xbar 3

show module internal activity xbar 3

show module internal event-history xbar 4

show module internal activity xbar 4

show module internal event-history xbar 5

show module internal activity xbar 5

show logging onboard internal xbar

show logging onboard internal octopus

show tech detail

For comments and feedback, please email Yogesh at yramdoss@cisco.com