ACE modules reloaded

aslamta123 · ‎07-12-2011

HI Experts,

We had some issue with Datacentre ACE modules. Both primary and DR ACE modules got restarted in 16 hours difference.

Unfortunately Syslog was not configured on the ACE and local logging got cleared after restart.

The current IOS version is A2(3.2). The modules uptime was around 300 Days.

Here is the log from 6509 switch during the restart

Primary DC 6509-1 .

Jul 10 18:52:05.383 WAT: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks

.Jul 10 18:56:47.291 WAT: %SNMP-5-MODULETRAP: Module 9 [Down] Trap

Jul 10 18:56:47.127 WAT: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Reset - Module Reloaded During Download)

Jul 10 18:56:47.271 WAT: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Reset - Module Reloaded During Download)

Jul 10 18:57:00.951 WAT: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Module not responding to Keep Alive polling)

Jul 10 18:57:00.951 WAT: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Module not responding to Keep Alive polling)

Jul 10 19:01:57.172 WAT: %DIAG-SP-6-RUN_MINIMUM: Module 9: Running Minimal Diagnostics...

.Jul 10 19:01:59.256 WAT: %SNMP-5-MODULETRAP: Module 9 [Up] Trap

Jul 10 19:01:58.700 WAT: %DIAG-SP-6-DIAG_OK: Module 9: Passed Online Diagnostics

Jul 10 19:01:59.256 WAT: %OIR-SP-6-INSCARD: Card inserted in slot 9, interfaces are now online

.Jul 10 19:02:04.548 WAT: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks

DR DC 6509-1 .

Jul 11 09:42:05.759: %LINK-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down .

Jul 11 09:42:05.763: %SNMP-5-MODULETRAP: Module 9 [Down] Trap

.Jul 11 09:42:05.763: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to down

Jul 11 09:42:05.599: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Reset - Module Reloaded During Download)

Jul 11 09:42:05.747: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Reset - Module Reloaded During Download)

Jul 11 09:42:05.767: %LINK-SP-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down

Jul 11 09:42:05.771: %LINEPROTO-SP-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to down .

Jul 11 09:42:14.535: %SVCLC-5-SVCLCNTP: Could not update clock on the module 9, rc is -1

Jul 11 09:42:19.395: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Module not responding to Keep Alive polling)

Jul 11 09:42:19.395: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Module not responding to Keep Alive polling)

Jul 11 09:47:15.819: %DIAG-SP-6-RUN_MINIMUM: Module 9: Running Minimal Diagnostics... .

Jul 11 09:47:19.871: %MLS_RATE-4-DISABLING: The global switching mode is now 'truncated'. Disabling the Layer2 Rate Limiters. .

Jul 11 09:47:19.903: %SNMP-5-MODULETRAP: Module 9 [Up] Trap Jul 11 09:47:19.633: %DIAG-SP-6-DIAG_OK: Module 9: Passed Online Diagnostics Jul 11 09:47:19.905: %OIR-SP-6-INSCARD: Card inserted in slot 9, interfaces are now online .

Jul 11 09:47:21.079: %LINK-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down

Jul 11 09:47:20.912: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet9/1, changed state to down

Jul 11 09:47:21.080: %LINK-SP-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down

.Jul 11 09:47:25.039: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks

.Jul 11 09:47:25.047: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to up

Jul 11 09:47:24.520: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet9/1, changed state to down

Jul 11 09:47:25.056: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet9/1, changed state to up

Jul 11 09:47:25.060: %LINEPROTO-SP-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to up

Please let me did anyone face this issue before or is it any known BUG?

Marko Leopold · ‎07-13-2011

Do you have any events in the "show diagnostic events module 9"?

aslamta123 · ‎07-13-2011

Hi Mark,

Thanks for the reply, Here is the show diag output

REC-DCD-6509-1#sh diagnostic events module 9
Diagnostic events (storage for 500 events, 44 events recorded)
Event Type (ET): I - Info, W - Warning, E - Error

Time Stamp         ET [Card] Event Message
------------------ -- ------ --------------------------------------------------
09/03 02:57:19.510 I [9]    Diagnostics Passed
07/10 18:56:48.035 E [9]    TestAsicSync Failed
07/10 19:01:58.696 I [9]    Diagnostics Passed

rgds

Aslam

Ahmad Basel Jaber · ‎07-13-2011

Hi Aslam,

Do you have any core dump file generated due to the reload? use "dir core:" command to check and compare the time stamp.

In case you have copy them out the box using copy core: command, then open a TAC case and attach them to the case for further analysis.

In case core files were not generated, then you dont have much information to check the root cause and you need to make sure that you have syslog enabled on your ACE so you can collect all needed information to troubleshoot the issue in the future.

Best regards,

Ahmad

Cesar Roque · ‎07-15-2011

Hello Aslam,

What is the last reboot reason in your ACE???

You can check that with the show version command

--------------------- Cesar R ANS Team

Marko Leopold · ‎07-13-2011

Hello Aslam!

We had the same ASIC failure with reboot on our ACE modules before (see my CSC thread #2037058) and support advised us to open a TAC case and to RMA the module. We did this, in case it was a productive enviroment there. Maybe you can check it and wait for the failure to happen again.

Cheers,

Marko

Jorge Bejarano · ‎07-15-2011

Hello,

As César said, it will be good to check the #show version command to determine what the reason was, it will give you the root fhe reboot exactly, in case it says: unknown, please take into consideration there are some bugs considered "silent bugs" which might trigger this type of behavior, depending on the reason of the reboot it will possible to need either an upgrade or a hardware replacement.

Hope this helps!!!

Jorge

aslamta123 · ‎07-18-2011

HI All, Thanx for the help. Got the resaon from show version output.

last boot reason: NP 1 Failed : SRAM Parity Error Chan 3

Also got the TAC comment on SRAM party error

The SRAM parity error presented in the core file is not due to a software issue.
The issue is the result of a "bit-flip" within the SRAM itself which can occur as a
result of environmental conditions. This "bit-flip" is rectified by a simple reboot of
the system, which would occur with the generation of the core file. Cisco internal
testing and customer experience has shown that these types of issues can occur
with very low frequency, but do not required an RMA of the device.
If there are multiple instances of this issue on the same module, a proactive RMA/EFA
of the device would be in order.

ACE is susceptible to this because of the way it uses SRAM to store control information
and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a
parity error. Cisco has recognized the issue and is taking action to ensure this will not be
an issue on the next generation of the ACE module. The next generation module design
and timeline is currently under review.

Thnx again for the help

Aslam