Solved: WS-C6509 was crashed and auto reboot

Wen Yu Zhai · ‎08-26-2014

one of our WS-C6509 was crashed and auto reboot. Can help to find out the root cause? thanks!

Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(33)SXI8a, RELEASE SOFTWARE (fc1)

System image file is "disk0:s72033-adventerprisek9_wan-mz.122-33.SXI8a.bin"
Last reload reason: error - a Software forced crash, PC 0x42B037D8

Jul 27 01:08:50 BJ: %SYSTEM_CONTROLLER-3-ERROR: Error condition detected: TM_DATA_PARITY_ERROR
Jul 27 01:08:50 BJ: %SYSTEM_CONTROLLER-3-FATAL: An unrecoverable error has been detected. The system is being reset.

%Software-forced reload

Early Notification of crash condition..

01:08:50 BJ Sun Jul 27 2014: Breakpoint exception, CPU signal 23, PC = 0x42B037D8

--------------------------------------------------------------------
Possible software fault. Upon reccurence, please collect
crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------

-Traceback= 42B037D8 42B0132C 426CE1DC 42AF661C
$0 : 00000000, AT : 44EF0000, v0 : 46AD0000, v1 : 00000000
a0 : 47B05CE4, a1 : 0000FF00, a2 : 00000000, a3 : 00000000
t0 : 00000020, t1 : 3400FF01, t2 : 3400C100, t3 : FFFF00FF
t4 : 42AF6760, t5 : 50012358, t6 : 00000000, t7 : B4EB6DB5
s0 : 00000000, s1 : 44D40000, s2 : 44CB0000, s3 : 00000001
s4 : 44CB0000, s5 : 10020000, s6 : 00000068, s7 : 444A0000
t8 : 08028FEC, t9 : 00000000, k0 : 00000000, k1 : 00000000
gp : 44EED8E4, sp : 50012488, s8 : 00000001, ra : 42B0132C
EPC : 42B037D8, ErrorEPC : 94D877EE, SREG : 3400FF03
MDLO : 00000000, MDHI : 00000000, BadVaddr : 00000000
DATA_START : 0x448523D0
Cause 00000024 (Code 0x9): Breakpoint exception

========= Start of Crashinfo Collection (01:08:50 BJ Sun Jul 27 2014) ==========
For image:
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(33)SXI8a, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2011 by Cisco Systems, Inc.

Compiled Sat 03-Dec-11 07:53 by prod_rel_team

InayathUlla Sharieff · ‎08-26-2014

Hi,

Please find explanation below:

Jul 27 01:08:50 BJ: %SYSTEM_CONTROLLER-3-ERROR: Error condition detected: TM_DATA_PARITY_ERROR
Jul 27 01:08:50 BJ: %SYSTEM_CONTROLLER-3-FATAL: An unrecoverable error has been detected. The system is being reset.

%Software-forced reload

Early Notification of crash condition..

01:08:50 BJ Sun Jul 27 2014: Breakpoint exception, CPU signal 23, PC = 0x42B037D8

Explanation

The most common errors from the Mistral ASIC on the Multilayer Switch Feature Card (MSFC) are TM_DATA_PARITY_ERROR, SYSDRAM_PARITY_ERROR,
SYSAD_PARITY_ERROR, and TM_NPP_PARITY_ERROR. The possible causes of these parity errors are random static discharge or other external factors.

Parity Errors are of two kinds:
. Soft parity errors - these occur when an energy level within the chip (for example, a one or a zero) changes - When referenced by the CPU, they cause the system to either crash or they recover. In case of a soft parity error, there is no need to swap the board or any of the components as they are generally Single Event Upsets (SEU).

. Hard parity errors - these occur when there is a chip or board failure that causes data to be corrupted (not bad all or most of the time). In this case, you need to re-seat or replace the affected component, usually a memory chip swap or a board swap. We say that there is a hard parity error when we see multiple parity errors at the same address. There are more complicated cases which are harder to identify but, in general, if we see more than one parity error in a particular memory region in a relatively short period of time, this may be considered as a hard parity error.

As this is the first occurrence this could be a transient issue. I suggest that we monitor for 48 hours to ensure it is stable and if there is no reoccurrence we can consider this a transient issue

Please let me know whether you have any questions or concerns with the analysis above.

Regards

Inayath

*Plz rate if this info is helpfull.

View solution in original post

InayathUlla Sharieff · ‎08-26-2014

Hi,

Please find explanation below:

Jul 27 01:08:50 BJ: %SYSTEM_CONTROLLER-3-ERROR: Error condition detected: TM_DATA_PARITY_ERROR
Jul 27 01:08:50 BJ: %SYSTEM_CONTROLLER-3-FATAL: An unrecoverable error has been detected. The system is being reset.

%Software-forced reload

Early Notification of crash condition..

01:08:50 BJ Sun Jul 27 2014: Breakpoint exception, CPU signal 23, PC = 0x42B037D8

Explanation

The most common errors from the Mistral ASIC on the Multilayer Switch Feature Card (MSFC) are TM_DATA_PARITY_ERROR, SYSDRAM_PARITY_ERROR,
SYSAD_PARITY_ERROR, and TM_NPP_PARITY_ERROR. The possible causes of these parity errors are random static discharge or other external factors.

Parity Errors are of two kinds:
. Soft parity errors - these occur when an energy level within the chip (for example, a one or a zero) changes - When referenced by the CPU, they cause the system to either crash or they recover. In case of a soft parity error, there is no need to swap the board or any of the components as they are generally Single Event Upsets (SEU).

. Hard parity errors - these occur when there is a chip or board failure that causes data to be corrupted (not bad all or most of the time). In this case, you need to re-seat or replace the affected component, usually a memory chip swap or a board swap. We say that there is a hard parity error when we see multiple parity errors at the same address. There are more complicated cases which are harder to identify but, in general, if we see more than one parity error in a particular memory region in a relatively short period of time, this may be considered as a hard parity error.

As this is the first occurrence this could be a transient issue. I suggest that we monitor for 48 hours to ensure it is stable and if there is no reoccurrence we can consider this a transient issue

Please let me know whether you have any questions or concerns with the analysis above.

Regards

Inayath

*Plz rate if this info is helpfull.

Wen Yu Zhai · ‎08-28-2014

Hi insharie,

The switch is working fine after reboot. Maybe it is soft parity error. Thank you for your reply.