on 08-15-2016 12:34 PM
Error threshold and recovery mechanism is defined by the ASIC type as well as by the type of error.
CIH-2-ASIC_ERROR_HARD_RESET XXX error occurred causing halt
hw-module reset daily threshold 5 location all
hw-module reset hourly threshold 2 location all
<1-10> number of resets after which the card will be placed in IN-RESET state
nolimit disable checking of reset threshold limit (default threshold limit is 5 for one hour, 8 for one day)
The card is reset on the 6th occurrence of the ASIC Hard_Reset:
LC/1/5/CPU0:Oct 19 13:35:43.548 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing halt. 0x130f000c
LC/1/5/CPU0:Oct 19 13:36:30.877 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing halt. 0x130f000c
LC/1/5/CPU0:Oct 19 13:53:02.170 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing halt. 0x130f000c
LC/1/5/CPU0:Oct 19 13:56:52.929 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing halt. 0x130f000c
LC/1/5/CPU0:Oct 19 13:58:03.238 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing halt. 0x130f000c
LC/1/5/CPU0:Oct 19 14:01:17.599 : ingressq[225]: %PLATFORM-CIH-1-ASIC_ERROR_REQUEST_RELOAD_BOARD : ingressq[0]: device is not recovered from fault - and reload is requested.
LC/1/5/CPU0:Oct 19 14:01:17.619 : ingressq[225]: %PLATFORM-CIH-2-ASIC_ERROR_HARD_RESET : ingressq[0]: A mbe error has occurred causing halt. 0x130f000c
LC/1/5/CPU0:Oct 19 14:01:22.775 : sysmgr[82]: %OS-SYSMGR-2-MANAGED_REBOOT : reboot to be managed by process (platform_mgr_common) reason (ASIC seal instance 0 in critical alarm)
LC/1/5/CPU0:Oct 19 14:04:23.607 : sysmgr[82]: %OS-SYSMGR-5-NOTICE : Card is COLD started
Single event
Repeated Errors
Configurable System Level Reset Thresholds
Error Category |
Threshold |
SBE |
20/sec or 80/day |
MBE |
5/sec or 20/day |
Parity |
5/sec or 20/day |
OOR |
1500/5min or 6000/day |
BP |
1500/min or 1200/day |
INDIRECT |
1500/5min or 6000/day |
LINK Error |
20/day |
Error type |
Behavior |
Current S/W action (for single event) |
SBE |
Single bit errors. Detected and corrected by h/w |
Optional re-write to the impacted address |
MBE |
Multi-bit errors. Detected by h/w |
ASIC Reset |
Parity |
Parity errors. Detected by h/w |
ASIC Reset |
Link Errors |
External interface link errors. Detected by h/w |
Link retrain, ASIC Reset or card reload |
Out of Resource |
Internal/external resource (e.g. packet memory). Detected by h/w |
ASIC reset or card reload |
Indirect Error |
Error due to peer ASIC. Detected by h/w. Cross card boundary, impacting System/Network, and worst when originated at IngressQ |
Card reload |
Misc |
Config errors, BP errors etc. Detected by h/w |
ASIC reset or card reload |
Error type |
Initial Recovery Attempt |
Repeated Occurrence response |
MBE |
Hard Reset/PON Reset/None |
Shut down the board on 2nd occurrence in 3 months. |
Parity – General |
Hard Reset/PON Reset/None |
Shut down the board on 2nd occurrence in 3 months. |
Parity – L2 TCAM |
Scrub the location |
Hard reset on threshold (2/sec or 8/day). Shut down the board if threshold reached second time in 24 hour window. |
Parity – L3 TCAM |
Scrub the location |
Shut down the board on 2ndoccurrence in 24 hour window. |
PLL Loss of Lock |
|
N/A |
Hi Osman,
Very informative content.
About "MBE/Parity Error Handling Changes (Release 6.1.1 or with a 5.3.3 SMU installed)", which SMU is it? Is that specific one?
I will handle the Release 5.3.4 in the near future so want to catch up on the behavior.
Thank you,
Katsu
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: