Re: ACE Crash due to SRAM Parity

sumaiyausa · ‎11-23-2009

Hi Experts,

My question is one of my ACE module running A2(1.6a) have been crashed due to SRAM parity error.

ACE20Admin#show version
Software Version A2(1.6a)

last boot reason: NP 1 Failed : SRAM Parity Error Chan 2

I would like to know is this a Software bug or an Hardware replacement is needed.

Thanks in advance.

Regards,

Sum.

Gilles Dufour · ‎11-23-2009

Sum,

a single SRAM parity error does not justify an RMA.

Unfortunately, SRAM's are very sensitive to light, dust, radiation, shock, temperature,... so it is possible to get an SRAM parity error on an healthy ACE.

Only, if you see repeated errors on the same blade is it an indication that there an hardware problem.

Gilles.

inayathulla1 · ‎11-23-2009

Hi Giles,
I have the same issue with me and when i reseached it I found an Bug and its been fixed in the 2.0 version.
BUg:-CSCsv52331 Bug Details: ACE crashes with SRAM parity error : source OCM ME

Hence this bug been resolved in A2(2.1) Release.
Resolved Cavets:-

CSCsv52331—The ACE becomes unresponsive due to an SRAM parity error. Workaround: None.

What is your opinion on this?????

Thanks in Advance.

Regards,

Inayath.

Gilles Dufour · ‎11-24-2009

yes, this is a particular case where we tried to access an address that does not actually exist.

There is not really a parity error. But it was detect as such assuming the pointer got corrupted in SRAM.

Anyway, when you do get an ACE crash (especially SRAM parity errors) it is really advised to open a service request with the TAC.

We can than make sure that this is software or hardware. And if a real parity error, we do keep track of them to see if there is a "bad" trend.

If we do not get all SRAM parity erros reported to us, we can't detect that there is a problem in the field.

Thanks.

Gilles.

JOHN WAITE · ‎02-18-2010

We had the same issue. Our standby ACE rebooted a couple of nights ago with this SRAM Parity error.

We opened a TAC case and this is the reply we got,

The SRAM parity error presented in the core file is not due to a software issue.
The issue is the result of a "bit-flip" within the SRAM itself which can occur as a
result of environmental conditions. This "bit-flip" is rectified by a simple reboot of
the system, which would occur with the generation of the core file. Cisco internal
testing and customer experience has shown that these types of issues can occur
with very low frequency, but do not required an RMA of the device.
If there are multiple instances of this issue on the same module, a proactive RMA/EFA
of the device would be in order.

ACE is susceptible to this because of the way it uses SRAM to store control information
and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a
parity error. Cisco has recognized the issue and is taking action to ensure this will not be
an issue on the next generation of the ACE module. The next generation module design
and timeline is currently under review.

We are running A2 2.3 code.

Everyone can derive their own opinion from that response. My take is that it's sounds like a hardware design issue to me. It certainly does not give us the "warm and fuzzy's" we've come to expect from Cisco.

Gilles Dufour · ‎02-22-2010

This is the problem with SRAM memory.

All equipment makers face the same issue with this type of memory.

This is the reason why we are working on a way to get rid of this type of memory.

G.

marciobaesse · ‎04-05-2012

Hi guys,

My error is : last boot reason: NP 2 Failed : SRAM Parity Error Chan 3

The issue is the result of a "bit-flip" within the SRAM itself which can occur as a result of environmental conditions. This "bit-flip" is rectified by a simple reboot of the system, which would occur with the generation of the core file. . Cisco internal testing and customer experience has shown that these types of issues can occur with very low frequency, but do not require an RMA of the device.

rugs,

Marcio Baesse

Jorge Bejarano · ‎04-05-2012

Hardware designers and developers in general have identified

this issue related to SRAM memory which might be triggered by

environmental conditions. The way how SRAM memory works makes it susceptible to suffer these issues, Cisco is highly focused on this currently and we are working on that.It is being seen that this behavior may be also linked to some software defects but if you have experienced this issue before and you are running at A2 2.3 then the recommendation is to proceed with a replacement since the device hardware might be affected at that moment. This issue occur with a low very frequency.

J.

marciobaesse · ‎04-05-2012

I received an update from Cisco, and we will monitor this ACE module.

If the problem appears, we will upgrade to the A2 (3.3).

tks,

Marcio Baesse

csimmons · ‎04-09-2012

Our ACE20 Version A2(3.3) reloaded "NP 1 failed : NP Control Store Parity Error" on 3/28

Per TAC we hit the following bug id: CSCsz65679

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsz65679

sogleedy41x · ‎09-07-2013

Here was the response from Cisco for my issue, hopefully can shed some light:

Problem Description

As I have understood it till now, the issue is, ACE20 module in slot 9 of the chassis and ace has crashed three times in some time variation and the cause for the module failure is hard parity error

There is a well known defect documented for crashes / unexpected reload because of parity errors.

tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsz65679


Symptom: The ACE Module crashed unexpected with a NP Control Store Parity Error which can be due to hardware. Conditions: Normal Operations. Workaround: None. Monitor the ACE Module and if this reoccurs a RMA should be considered.

Explanation : -

The SRAM parity error presented in the core file is not due to a software issue. The issue is the result of a "bit-flip" within the SRAM itself which can occur as a result of environmental conditions. This "bit-flip" is rectified by a simple reboot of the system, which would occur with the generation of the core file. Cisco internal testing and customer experience has shown that these types of issues can occur with very low frequency, but do not require an RMA of the device.

ACE is susceptible to this because of the way it uses SRAM to store control information and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a parity error.

CSCtc53046 is a partial software workaround which mitigates hardware generated SRAM parity errors by reducing the amount of access to the SRAM due to the collection of the interface
statistics. It is highly recommended that you upgrade to A2(3.3) or later to both lower the overall rate of SRAM parity errors and ensure failover occurs appropriately.

SRAM errors are expected to occur at a frequency of approximately one per year per ACE module. If a particular module experiences a significantly higher failure rate and is running A2(3.3) or later, then a proactive RMA would be in order.

Suggestion:-

1. Since you are already running A2(3.2), I would suggest you to first upgrade to A2(3.3) and then monitor if the device crashes again.

2. If the same happens again, we should RMA the module.