Solved: Sup 720 MSFC3 (HOT) Reload in 6509 Chassis

rpastrana · ‎03-09-2013

Hello, we have a backup sup 720 which has a 2 gigabit ethernet though port channel, to another chassis. Suddenly UDLD detected an error and got into err disable, then this err disable didn't let the interface set to DOWN, and created a switch loop, then our Supervisor reloaded. I'd like to know what could have caused this reload, from any experience someone could have had the same issue. In my opinion could have a been the switch loop, but also I've been checking from the output interpreter the show tech and might have been a bug, the only one that could match in IOS version 12.2(33)SXH, is this one:

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCtj95352&from=summary

We're going to disable err-disable next time I guess and recover the link manually, apart from that if anyone ever had this issue, what could have made the sup for crash and reload?

Kind regards.

InayathUlla Sharieff · ‎03-11-2013

Hi Pastrana,

I just decoded the traceback generated on the crash info file and I see that the device has been crashed due to Parity error.

Cache error detected!

CPO_ECC (reg 26/0): 0x000000FC

CPO_CACHERI (reg 27/0): 0x20000000

CP0_CAUSE (reg 13/0): 0x00000800

Real cache error detected. System will be halted.

Error: Primary instr cache, fields: data,

Actual physical addr 0x00000000,

virtual address is imprecise.

Imprecise Data Parity Error

21:50:17 GMT+1 Thu Mar 7 2013: Interrupt exception, CPU signal 20, PC = 0x420DA8A4

Explanation and Action plan:

====================

. In most cases, a parity error is caused due to transient software issue and would recover by itself after reset.

These are the two kinds of parity errors:

Soft parity errors

These errors occur when an energy level within the chip (for example, a one or a zero)

changes. When referenced by the CPU, such errors cause the system to either crash (if the

error is in an area that is not recoverable) or they recover other systems (for example, a

CyBus complex restarts if the error was in the packet memory (MEMD)). In case of a soft

parity error, there is no need to swap the board or any of the components.

Hard parity errors

These errors occur when there is a chip or board failure that corrupts data. In this case,

you need to re-seat or replace the affected component, which usually involves a memory

chip swap or a board swap. There is a hard parity error when multiple parity errors occur

at the same address. There are more complicated cases that are harder to identify. In

general, if you see more than one parity error in a particular memory region in a

relatively short period, you can consider it to be a hard parity error.

Suggestion:

Studies have shown that soft parity errors are 10 to 100 times more frequent than hard

parity errors. Therefore, Cisco highly recommends you to wait for a hard parity error

before you replace Supervisor. This greatly reduces the impact on your network

To learn more about Parity Errors please check the following CCO documentations:

https://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186a0080094793.

html

HTH

Regards

Inayath

*Plz rate the usefull posts.

View solution in original post

InayathUlla Sharieff · ‎03-10-2013

HI,

Could you please provide the show tech and crash info file from the switch ?

Regards

Inayath.

rpastrana · ‎03-11-2013

Here is a link to download both logs:

https://dl.dropbox.com/u/53867682/logs.rar

Thank you.

InayathUlla Sharieff · ‎03-11-2013

Hi Pastrana,

I just decoded the traceback generated on the crash info file and I see that the device has been crashed due to Parity error.

Cache error detected!

CPO_ECC (reg 26/0): 0x000000FC

CPO_CACHERI (reg 27/0): 0x20000000

CP0_CAUSE (reg 13/0): 0x00000800

Real cache error detected. System will be halted.

Error: Primary instr cache, fields: data,

Actual physical addr 0x00000000,

virtual address is imprecise.

Imprecise Data Parity Error

21:50:17 GMT+1 Thu Mar 7 2013: Interrupt exception, CPU signal 20, PC = 0x420DA8A4

Explanation and Action plan:

====================

. In most cases, a parity error is caused due to transient software issue and would recover by itself after reset.

These are the two kinds of parity errors:

Soft parity errors

These errors occur when an energy level within the chip (for example, a one or a zero)

changes. When referenced by the CPU, such errors cause the system to either crash (if the

error is in an area that is not recoverable) or they recover other systems (for example, a

CyBus complex restarts if the error was in the packet memory (MEMD)). In case of a soft

parity error, there is no need to swap the board or any of the components.

Hard parity errors

These errors occur when there is a chip or board failure that corrupts data. In this case,

you need to re-seat or replace the affected component, which usually involves a memory

chip swap or a board swap. There is a hard parity error when multiple parity errors occur

at the same address. There are more complicated cases that are harder to identify. In

general, if you see more than one parity error in a particular memory region in a

relatively short period, you can consider it to be a hard parity error.

Suggestion:

Studies have shown that soft parity errors are 10 to 100 times more frequent than hard

parity errors. Therefore, Cisco highly recommends you to wait for a hard parity error

before you replace Supervisor. This greatly reduces the impact on your network

To learn more about Parity Errors please check the following CCO documentations:

https://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186a0080094793.

html

HTH

Regards

Inayath

*Plz rate the usefull posts.

rpastrana · ‎03-12-2013

Thank you very much for your answer.