cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

CRS ASIC errors explained

1964
Views
5
Helpful
1
Comments

Categories of ASIC errors 

  • Single Bit Errors
  • Multiple Bit Errors
  • Parity Errors
  • CRC Errors
  • Generic Errors
  • Barrier Errors
  • Unexpected Errors
  • Link Errors
  • OOR Thresh
  • BP Errors
  • IO Errors
  • Ucode Errors
  • Config Errors
  • Indirect Errors
  • ASIC Reset Errors
  • Non Error

 

Error Threshold and Recovery

Error threshold and recovery mechanism is defined by the ASIC type as well as by the type of error.

Triggers

ASIC Reset

  • ASIC is reset after predefined number of errors.
  • Predefined thresholds define when a ASIC would reset.
  • Following syslogs are seen depending on the type of ASIC and type of error

CIH-2-ASIC_ERROR_HARD_RESET  XXX error occurred causing halt

Linecard Reload

  • In some cases, ASIC resets does not cure the problem and ASIC would continue to reset.
  • ASIC reset thresholds define thresholds for ASIC resets/unit of time until the LC is reloaded.
  • ASIC_ERROR_REQUEST_RELOAD_BOARD device is not recovered from fault and reload is requested. 

Linecard Shutdown

  • If LC reload does not fix the issue, the LC would be shut down.
  • LC reload threshold is defined by admin config, e.g.

hw-module reset daily threshold 5 location all

hw-module reset hourly threshold 2 location all

<1-10>   number of resets after which the card will be placed in IN-RESET state

  nolimit  disable checking of reset threshold limit (default threshold limit is 5 for one hour, 8 for one day)

 

  • Syslog: SHUTDOWN_ON_MULTIPLE_UNGRACEFUL_RELOAD

 

Examples

The card is reset  on the 6th occurrence of the  ASIC Hard_Reset:
LC/1/5/CPU0:Oct 19 13:35:43.548 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:36:30.877 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:53:02.170 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:56:52.929 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 13:58:03.238 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 14:01:17.599 : ingressq[225]: %PLATFORM-CIH-1-ASIC_ERROR_REQUEST_RELOAD_BOARD : ingressq[0]: device is not recovered from fault - and reload is requested.
LC/1/5/CPU0:Oct 19 14:01:17.619 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing  halt. 0x130f000c  
LC/1/5/CPU0:Oct 19 14:01:22.775 : sysmgr[82]: %OS-SYSMGR-2-MANAGED_REBOOT : reboot to be managed by process (platform_mgr_common) reason (ASIC seal instance 0 in critical alarm)
LC/1/5/CPU0:Oct 19 14:04:23.607 : sysmgr[82]: %OS-SYSMGR-5-NOTICE : Card is COLD started

 

Enhancements

 

  • CSCuv58131  : Bringdown Fabric board on hitting FGID SRAM parity error
    • This is specific for the Fabric boards.
    • Fixed in 5.3.3 ( SMU available for 5.3.1 )
  • CSCuw99811  : Reload taiko MSC on hitting QDRAM MBE on Seal ASIC.
    • Fixed in 5.3.3

 

Scope of ASIC Errors and Recovery 

Single event 

  • Transient errors that may not persist for long duration. Least impactful recovery mechanism is recommended (memory scrubbing or hard reset) for recovering from such errors.

Repeated Errors

  • There are cases where Single event recovery mechanism fails to recover the system. If this error count reaches a threshold within a well defined time window, more impactful recovery action will be initiated (e.g. PON reset, board reload etc).
  • Repeated Error threshold is monitored per second and per day (E.g. For MBE errors, the threshold is 5/sec or 20/day)

Configurable System Level Reset Thresholds

  • Five hard resets or PON reset of the ASIC per day results in card reload
  • Five card reloads per day results in card shutdown

 

System Error Thresholds

 

Error Category

Threshold

SBE

20/sec or 80/day

MBE

5/sec or 20/day

Parity

5/sec or 20/day

OOR

1500/5min or 6000/day

BP

1500/min or 1200/day

INDIRECT

1500/5min or 6000/day

LINK Error

20/day

 

Error Handling

 

Error type

Behavior

Current S/W action (for single event)

SBE

Single bit errors. Detected and corrected by h/w

Optional re-write to the impacted address

MBE

Multi-bit errors. Detected by h/w

ASIC Reset

Parity

Parity errors. Detected by h/w

ASIC Reset

Link Errors

External interface link errors. Detected by h/w

Link retrain, ASIC Reset or card reload

Out of Resource

Internal/external resource (e.g. packet memory). Detected by h/w

ASIC reset or card reload

Indirect Error

Error due to peer ASIC. Detected by h/w. Cross card boundary, impacting System/Network, and worst when originated at IngressQ

Card reload

Misc

Config errors, BP errors etc. Detected by h/w

ASIC reset or card reload

 

MBE/Parity Error Handling Changes (Release 6.1.1 or with a 5.3.3 SMU installed)

 

Error type

Initial Recovery Attempt

Repeated Occurrence response

MBE

Hard Reset/PON Reset/None

Shut down the board on 2nd occurrence in 3 months.

Parity – General

Hard Reset/PON Reset/None

Shut down the board on 2nd occurrence in 3 months.

Parity – L2 TCAM

Scrub the location

Hard reset on threshold (2/sec or 8/day). Shut down the board if threshold reached second time in 24 hour window.

Parity – L3 TCAM

Scrub the location

Shut down the board on 2ndoccurrence in 24 hour window.

PLL Loss of Lock

    • Current recovery: PON reset
    • New recovery: Error will be ignored per recommendation from vendor on Beluga/Pogo/Seal/Superstar ASICs.

N/A

 

Comments
Beginner

Hi Osman,

Very informative content.

About "MBE/Parity Error Handling Changes (Release 6.1.1 or with a 5.3.3 SMU installed)",  which SMU is it? Is that specific one?

I will handle the Release 5.3.4 in the near future so want to catch up on the behavior.

Thank you,

Katsu

CreatePlease to create content
Content for Community-Ad
August's Community Spotlight Awards