cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3259
Views
0
Helpful
5
Replies

Urgent !!! Catalyst 6506-E crashed

jackson.ku
Level 3
Level 3

Hi,

 

Recently our Catalyst 6506-E crashed and reload automatically.

Last reload reason : bus error at PC 0x428AE718, address 0x0

The IOS version is 12.2(33)SXJ6, and attached are the crashinfo file. Please kindly help to analysis.

 

Many Thanks,

Jackson Ku

1 Accepted Solution

Accepted Solutions

Jackson,

After reviewing the show tech and crash info files, it looks like the RP crashed due to a parity error. A parity error will occur when a binary bit flips in values from a 0 to a 1 or vice versa and could be attributed to some kind of environment issue causing a fluctuation in electrical pulses such as background radiation (such as neutrons from cosmic rays), electromagnetic interference (EMI), or electrostatic discharge (ESD). If this is the first time occurrence, it is advisable to monitor the SUP for 48 hours to ensure that it doesn't reoccur as in most parity errors tend to be a transient/one-time occurrence and parity error stemming from faulty hardware will usually reoccur within this time frame. If the SUP is to experience another parity error, it is advisable to replace it as it could indicate that the hardware is defective

As you know the sup module would have RP and SP processor. RP which takes care of Routing process were in SP is responsible for switching part.

RP Crash info:
===========

 

Oct 17 14:24:00: %SYSTEM_CONTROLLER-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR
Oct 17 14:24:00: %SYSTEM_CONTROLLER-3-FATAL: An unrecoverable error has been detected. The system is being reset.

%Software-forced reload

 Early Notification of crash condition..

 14:24:00 TWN Fri Oct 17 2014: Breakpoint exception, CPU signal 23, PC = 0x428AE718


Explanation:

The most common errors from the Mistral ASIC on the Multilayer Switch Feature Card (MSFC) are TM_DATA_PARITY_ERROR, SYSDRAM_PARITY_ERROR,
SYSAD_PARITY_ERROR, and TM_NPP_PARITY_ERROR. The possible causes of these parity errors are random static discharge or other external factors.

Parity Errors are of two kinds:
.         Soft parity errors - these occur when an energy level within the chip (for example, a one or a zero) changes - When  referenced by the CPU, they cause the system to either crash or they recover. In case of a soft parity error, there is no need to swap the board or any of the components as they are generally Single Event Upsets (SEU).

.         Hard parity errors - these occur when there is a chip or board failure that causes data to be corrupted (not bad all or most of the time). In this case, you need to re-seat or replace the affected component, usually a memory chip swap or a board swap. We say that there is a hard parity error when we see multiple parity errors at the same address. There are more complicated cases which are harder to identify but, in general, if we see more than one parity error in a particular memory region in a relatively short period of time, this may be considered as a hard parity error.

Action plan:
=======
As this is the first occurrence this could be a transient issue. I suggest that we monitor for 48 hours to ensure it is stable  and if there is no reoccurrence we can consider this a transient issue 

HTH

Regards

Inayath

**Please dont forget to rate if this info is helpfull.

View solution in original post

5 Replies 5

devils_advocate
Level 7
Level 7

Have you got Smartnet?

If so, it may be best to go down this route and log something with TAC...

InayathUlla Sharieff
Cisco Employee
Cisco Employee

I dont see any attachments.

Kindly attach the same again.

Hi, upload again. Thanks a lot.

Jackson,

After reviewing the show tech and crash info files, it looks like the RP crashed due to a parity error. A parity error will occur when a binary bit flips in values from a 0 to a 1 or vice versa and could be attributed to some kind of environment issue causing a fluctuation in electrical pulses such as background radiation (such as neutrons from cosmic rays), electromagnetic interference (EMI), or electrostatic discharge (ESD). If this is the first time occurrence, it is advisable to monitor the SUP for 48 hours to ensure that it doesn't reoccur as in most parity errors tend to be a transient/one-time occurrence and parity error stemming from faulty hardware will usually reoccur within this time frame. If the SUP is to experience another parity error, it is advisable to replace it as it could indicate that the hardware is defective

As you know the sup module would have RP and SP processor. RP which takes care of Routing process were in SP is responsible for switching part.

RP Crash info:
===========

 

Oct 17 14:24:00: %SYSTEM_CONTROLLER-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR
Oct 17 14:24:00: %SYSTEM_CONTROLLER-3-FATAL: An unrecoverable error has been detected. The system is being reset.

%Software-forced reload

 Early Notification of crash condition..

 14:24:00 TWN Fri Oct 17 2014: Breakpoint exception, CPU signal 23, PC = 0x428AE718


Explanation:

The most common errors from the Mistral ASIC on the Multilayer Switch Feature Card (MSFC) are TM_DATA_PARITY_ERROR, SYSDRAM_PARITY_ERROR,
SYSAD_PARITY_ERROR, and TM_NPP_PARITY_ERROR. The possible causes of these parity errors are random static discharge or other external factors.

Parity Errors are of two kinds:
.         Soft parity errors - these occur when an energy level within the chip (for example, a one or a zero) changes - When  referenced by the CPU, they cause the system to either crash or they recover. In case of a soft parity error, there is no need to swap the board or any of the components as they are generally Single Event Upsets (SEU).

.         Hard parity errors - these occur when there is a chip or board failure that causes data to be corrupted (not bad all or most of the time). In this case, you need to re-seat or replace the affected component, usually a memory chip swap or a board swap. We say that there is a hard parity error when we see multiple parity errors at the same address. There are more complicated cases which are harder to identify but, in general, if we see more than one parity error in a particular memory region in a relatively short period of time, this may be considered as a hard parity error.

Action plan:
=======
As this is the first occurrence this could be a transient issue. I suggest that we monitor for 48 hours to ensure it is stable  and if there is no reoccurrence we can consider this a transient issue 

HTH

Regards

Inayath

**Please dont forget to rate if this info is helpfull.

joleloves12
Level 1
Level 1

SERIOUSLY NEED HELP !!

My 2 Catalyst 6506-E go on Rommon and boot up any Ios image. This is the crashinfo collected.

cisco WS-C6506-E (R7000) processor (revision 1.2) with 458720K/65536K bytes of memory.
Processor board ID SAL1428MUNZ
SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache
Last reset from s/w reset
1 Virtual Ethernet interface
50 Gigabit Ethernet interfaces
16 Ten Gigabit Ethernet interfaces
1917K bytes of non-volatile configuration memory.
8192K bytes of packet buffer memory.

65536K bytes of Flash internal SIMM (Sector size 512K).
Missing file name
Logging of %SNMP-3-AUTHFAIL is enabled
Warning: sup-bootdisk:system does not exist.  Command retained.


Press RETURN to get started!


*May 31 00:14:51.119: % SNMP ID Persistence Error : Unable to open file : No such file or directory
*May 31 00:14:53.055: RP: Currently running ROMMON from S (Gold) region
*May 31 00:14:53.887: %SPANTREE-5-EXTENDED_SYSID: Extended SysId enabled for type vlan. The Bridge IDs of all active STP instances have been updated, which might change the spanning tree topology
000004: *May 31 00:15:00.439 UTC: %SYS-5-CONFIG_I: Configured from memory by console
000005: *May 31 00:15:03.847 UTC: %SYS-5-RESTART: System restarted --
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXI14, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Thu 04-Sep-14 00:38 by prod_rel_team
000006: *May 31 00:15:03.891 UTC: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 10.10.140.5 port 514 started - CLI initiated
000007: *May 31 00:15:06.219 UTC: %SNMP-5-COLDSTART: SNMP agent on host DIST-6500-6E is undergoing a cold start
000008: *May 31 00:16:19.427: %SYS-SP-3-LOGGER_FLUSHED: System was paused for 00:00:00 to ensure console debugging output.
000009: *May 31 00:14:48.467: %SPANTREE-SP-5-EXTENDED_SYSID: Extended SysId enabled for type vlan. The Bridge IDs of all active STP instances have been updated, which might change the spanning tree topology
*May 31 00:14:48.483: SP: SP: Currently running ROMMON from S (Gold) region
000010: *May 31 00:14:49.223: %SCHED-SP-7-WATCH: Attempt to set uninitialized watched boolean (address 0). -Process= "<interrupt level>", ipl= 1, pid= 3
-Traceback= 408EF7E4 40DC31EC 40DC3218 40D98EE0 40D97D6C 40D94000 40FB405C 417B9B3C
000011: *May 31 00:15:02.867: %SYS-SP-5-RESTART: System restarted --
Cisco IOS Software, s72033_sp Software (s72033_sp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXI14, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Thu 04-Sep-14 00:58 by prod_rel_team
000012: *May 31 00:15:03.971 UTC: %OIR-SP-6-INSPS: Power supply inserted in slot 1
000013: *May 31 00:15:04.079 UTC: %C6KPWR-SP-4-PSOK: power supply 1 turned on.
000014: *May 31 00:15:04.327 UTC: %OIR-SP-6-INSPS: Power supply inserted in slot 2
000015: *May 31 00:15:04.431 UTC: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
000016: *May 31 00:15:04.535 UTC: %C6KPWR-SP-4-PSREDUNDANTBOTHSUPPLY: in power-redundancy mode, system is operating on both power supplies.
000017: *May 31 00:15:07.990 UTC: %C6KENV-SP-4-FANHPMODE: Fan-tray 1 is operating in high power mode
000018: *May 31 00:15:15.157 UTC: %FABRIC-SP-5-CLEAR_BLOCK: Clear block option is off for the fabric in slot 5.
000019: *May 31 00:15:15.257 UTC: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
000020: *May 31 00:15:16.710 UTC: %DIAG-SP-6-RUN_MINIMUM: Module 5: Running Minimal Diagnostics...

%Software-forced reload

 Early Notification of crash condition..

 00:15:30 UTC Thu May 31 2018: Breakpoint exception, CPU signal 23, PC = 0x42DB3234


--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------

-Traceback= 42DB3234 42DB0D74 42A103A8 42A103D4 4281FC24 4289FE80 4289FEDC 40993B48 40994A18 40994938 4099570C 4298AA48 4297C4AC 4297C6C8 42DA58E4
$0 : 00000000, AT : 44A60000, v0 : 442B0000, v1 : 00000000
a0 : 50C680CC, a1 : 0000F100, a2 : 00000000, a3 : 00000000
t0 : 00000020, t1 : 3400F101, t2 : 3400C100, t3 : FFFF00FF
t4 : 42DA60C0, t5 : 656D732C, t6 : 2E0A436F, t7 : 65642054
s0 : 00000000, s1 : 44950000, s2 : 504766C0, s3 : 0000001F
s4 : 504766C0, s5 : 5104FC98, s6 : 00000000, s7 : 08B177C8
t8 : 08028FEC, t9 : 00000000, k0 : 00000000, k1 : 00000000
gp : 44A63C0C, sp : 5000DBB0, s8 : 00000000, ra : 42DB0D74
EPC  : 42DB3234, ErrorEPC : CBBFD491, SREG     : 3400F103
MDLO : 00000000, MDHI     : 00000000, BadVaddr : 00000000
DATA_START : 0x4441B9D0
Cause 00000824 (Code 0x9): Breakpoint exception

Writing crashinfo to bootflash:crashinfo_20180531-001530-UTC

=== Flushing messages (00:15:30 UTC Thu May 31 2018) ===

Buffered messages:

*May 31 00:14:53.887: %SPANTREE-5-EXTENDED_SYSID: Extended SysId enabled for type vlan. The Bridge IDs of all active STP instances have been updated, which might change the spanning tree topology
000004: *May 31 00:15:00.439 UTC: %SYS-5-CONFIG_I: Configured from memory by console
000005: *May 31 00:15:03.847 UTC: %SYS-5-RESTART: System restarted --
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXI14, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Thu 04-Sep-14 00:38 by prod_rel_team
000007: *May 31 00:15:06.219 UTC: %SNMP-5-COLDSTART: SNMP agent on host DIST-6500-6E is undergoing a cold start
000008: *May 31 00:16:19.427: %SYS-SP-3-LOGGER_FLUSHED: System was paused for 00:00:00 to ensure console debugging output.
000009: *May 31 00:14:48.467: %SPANTREE-SP-5-EXTENDED_SYSID: Extended SysId enabled for type vlan. The Bridge IDs of all active STP instances have been updated, which might change the spanning tree topology
000011: *May 31 00:15:02.867: %SYS-SP-5-RESTART: System restarted --
000013: *May 31 00:15:04.079 UTC: %C6KPWR-SP-4-PSOK: power supply 1 turned on.
000015: *May 31 00:15:04.431 UTC: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
000016: *May 31 00:15:04.535 UTC: %C6KPWR-SP-4-PSREDUNDANTBOTHSUPPLY: in power-redundancy mode, system is operating on both power supplies.
000017: *May 31 00:15:07.990 UTC: %C6KENV-SP-4-FANHPMODE: Fan-tray 1 is operating in high power mode
000018: *May 31 00:15:15.157 UTC: %FABRIC-SP-5-CLEAR_BLOCK: Clear block option is off for the fabric in slot 5.
000019: *May 31 00:15:15.257 UTC: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
000021: *May 31 00:15:29.395 UTC: %DIAG-SP-3-MAJOR: Module 5: Online Diagnostics detected a Major Error. Please use 'show diagnostic result <target>' to see test results.
000022: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestFibDevices failed
000023: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestIPv4FibShortcut failed
000024: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestIPv6FibShortcut failed
000025: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestMPLSFibShortcut failed
000026: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestNATFibShortcut failed
Queued messages:
000028: *May 31 00:15:30.023 UTC: %SYS-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

000021: *May 31 00:15:29.395 UTC: %DIAG-SP-3-MAJOR: Module 5: Online Diagnostics detected a Major Error. Please use 'show diagnostic result <target>' to see test results.
000022: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestFibDevices failed
000023: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestIPv4FibShortcut failed
000024: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestIPv6FibShortcut failed
000025: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestMPLSFibShortcut failed
000026: *May 31 00:15:29.399 UTC: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5: TestNATFibShortcut failed
000027: *May 31 00:15:29.755 UTC: %HA_EM-6-LOG: Mandatory.go_bootup.tcl: GOLD EEM TCL policy for  boot up diagnostic
000028: *May 31 00:15:30.003 UTC: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed due to exception , reset by [5/0]
*** System received a Software forced crash ***
signal= 0x17, code= 0x24, context= 0x46639064
  PC = 0x42da611c, SP = 0x44948868, RA = 0x413eaef0
  Cause Reg = 0x00003820, Status Reg = 0x34008002
rommon 1 > dir bootflash:

This operation is not permitted after send-break.
rommon 2 > boot
Please reset before booting
rommon 3 > reset

System Bootstrap, Version 12.2(17r)SX6, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 2009 by cisco Systems, Inc.
Cat6k-Sup720/RP platform with 524288 Kbytes of main memory

System Bootstrap, Version 8.5(3)
Copyright (c) 1994-2008 by cisco Systems, Inc.

Testing lower main memory - data equals address
Testing lower main memory - checkerboard
Testing lower main memory - inverse checkerboard
Clearing lower memory for cache initialization
Clearing bss
Clearing autoboot state machine
melody_present_reg: 1st read w/ 0x5555
melody_present_reg: 2nd read w/ 0xaaaa, reversed: 0x5555
Bootdisk adapter is detected, enabling bootdisk access...
Reprogramming CS1 w/ Melody value...

Reading monitor variables from NVRAM
Reset reason for CPU board 0xffff , BaseBoard 0x200ffff, display 0x0System Reset by Power On.

Enabling interrupts
Initializing TLB
Initializing cache
Initializing required TLB entries
Initializing main memory
Sizing NVRAM
Initializing PCMCIA controller
Exiting init
Cat6k-Sup720/SP processor with 524288 Kbytes of main memory

rommon 1 >

Review Cisco Networking products for a $25 gift card