cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5604
Views
5
Helpful
1
Comments
osman
Cisco Employee
Cisco Employee

Categories of ASIC errors 

  • Single Bit Errors
  • Multiple Bit Errors
  • Parity Errors
  • CRC Errors
  • Generic Errors
  • Barrier Errors
  • Unexpected Errors
  • Link Errors
  • OOR Thresh
  • BP Errors
  • IO Errors
  • Ucode Errors
  • Config Errors
  • Indirect Errors
  • ASIC Reset Errors
  • Non Error

 

Error Threshold and Recovery

Error threshold and recovery mechanism is defined by the ASIC type as well as by the type of error.

Triggers

ASIC Reset

  • ASIC is reset after predefined number of errors.
  • Predefined thresholds define when a ASIC would reset.
  • Following syslogs are seen depending on the type of ASIC and type of error

CIH-2-ASIC_ERROR_HARD_RESET  XXX error occurred causing halt

Linecard Reload

  • In some cases, ASIC resets does not cure the problem and ASIC would continue to reset.
  • ASIC reset thresholds define thresholds for ASIC resets/unit of time until the LC is reloaded.
  • ASIC_ERROR_REQUEST_RELOAD_BOARD device is not recovered from fault and reload is requested. 

Linecard Shutdown

  • If LC reload does not fix the issue, the LC would be shut down.
  • LC reload threshold is defined by admin config, e.g.

hw-module reset daily threshold 5 location all

hw-module reset hourly threshold 2 location all

<1-10>   number of resets after which the card will be placed in IN-RESET state

  nolimit  disable checking of reset threshold limit (default threshold limit is 5 for one hour, 8 for one day)

 

  • Syslog: SHUTDOWN_ON_MULTIPLE_UNGRACEFUL_RELOAD

 

Examples

The card is reset  on the 6th occurrence of the  ASIC Hard_Reset:
LC/1/5/CPU0:Oct 19 13:35:43.548 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:36:30.877 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:53:02.170 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:56:52.929 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:58:03.238 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 14:01:17.599 : ingressq[225]: %PLATFORM-CIH-1-ASIC_ERROR_REQUEST_RELOAD_BOARD : ingressq[0]: device is not recovered from fault - and reload is requested.
LC/1/5/CPU0:Oct 19 14:01:17.619 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 14:01:22.775 : sysmgr[82]: %OS-SYSMGR-2-MANAGED_REBOOT : reboot to be managed by process (platform_mgr_common) reason (ASIC seal instance 0 in critical alarm)
LC/1/5/CPU0:Oct 19 14:04:23.607 : sysmgr[82]: %OS-SYSMGR-5-NOTICE : Card is COLD started

 

Enhancements

 

  • CSCuv58131  : Bringdown Fabric board on hitting FGID SRAM parity error
    • This is specific for the Fabric boards.
    • Fixed in 5.3.3 ( SMU available for 5.3.1 )
  • CSCuw99811  : Reload taiko MSC on hitting QDRAM MBE on Seal ASIC.
    • Fixed in 5.3.3

 

Scope of ASIC Errors and Recovery 

Single event 

  • Transient errors that may not persist for long duration. Least impactful recovery mechanism is recommended (memory scrubbing or hard reset) for recovering from such errors.

Repeated Errors

  • There are cases where Single event recovery mechanism fails to recover the system. If this error count reaches a threshold within a well defined time window, more impactful recovery action will be initiated (e.g. PON reset, board reload etc).
  • Repeated Error threshold is monitored per second and per day (E.g. For MBE errors, the threshold is 5/sec or 20/day)

Configurable System Level Reset Thresholds

  • Five hard resets or PON reset of the ASIC per day results in card reload
  • Five card reloads per day results in card shutdown

 

System Error Thresholds

 

Error Category

Threshold

SBE

20/sec or 80/day

MBE

5/sec or 20/day

Parity

5/sec or 20/day

OOR

1500/5min or 6000/day

BP

1500/min or 1200/day

INDIRECT

1500/5min or 6000/day

LINK Error

20/day

 

Error Handling

 

Error type

Behavior

Current S/W action (for single event)

SBE

Single bit errors. Detected and corrected by h/w

Optional re-write to the impacted address

MBE

Multi-bit errors. Detected by h/w

ASIC Reset

Parity

Parity errors. Detected by h/w

ASIC Reset

Link Errors

External interface link errors. Detected by h/w

Link retrain, ASIC Reset or card reload

Out of Resource

Internal/external resource (e.g. packet memory). Detected by h/w

ASIC reset or card reload

Indirect Error

Error due to peer ASIC. Detected by h/w. Cross card boundary, impacting System/Network, and worst when originated at IngressQ

Card reload

Misc

Config errors, BP errors etc. Detected by h/w

ASIC reset or card reload

 

MBE/Parity Error Handling Changes (Release 6.1.1 or with a 5.3.3 SMU installed)

 

Error type

Initial Recovery Attempt

Repeated Occurrence response

MBE

Hard Reset/PON Reset/None

Shut down the board on 2nd occurrence in 3 months.

Parity – General

Hard Reset/PON Reset/None

Shut down the board on 2nd occurrence in 3 months.

Parity – L2 TCAM

Scrub the location

Hard reset on threshold (2/sec or 8/day). Shut down the board if threshold reached second time in 24 hour window.

Parity – L3 TCAM

Scrub the location

Shut down the board on 2ndoccurrence in 24 hour window.

PLL Loss of Lock

    • Current recovery: PON reset
    • New recovery: Error will be ignored per recommendation from vendor on Beluga/Pogo/Seal/Superstar ASICs.

N/A

 

Comments
katsu0103
Level 1
Level 1

Hi Osman,

Very informative content.

About "MBE/Parity Error Handling Changes (Release 6.1.1 or with a 5.3.3 SMU installed)",  which SMU is it? Is that specific one?

I will handle the Release 5.3.4 in the near future so want to catch up on the behavior.

Thank you,

Katsu

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Quick Links