07-12-2011 10:26 PM
HI Experts,
We had some issue with Datacentre ACE modules. Both primary and DR ACE modules got restarted in 16 hours difference.
Unfortunately Syslog was not configured on the ACE and local logging got cleared after restart.
The current IOS version is A2(3.2). The modules uptime was around 300 Days.
Here is the log from 6509 switch during the restart
Primary DC 6509-1 .
Jul 10 18:52:05.383 WAT: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks
.Jul 10 18:56:47.291 WAT: %SNMP-5-MODULETRAP: Module 9 [Down] Trap
Jul 10 18:56:47.127 WAT: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Reset - Module Reloaded During Download)
Jul 10 18:56:47.271 WAT: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Reset - Module Reloaded During Download)
Jul 10 18:57:00.951 WAT: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Module not responding to Keep Alive polling)
Jul 10 18:57:00.951 WAT: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Module not responding to Keep Alive polling)
Jul 10 19:01:57.172 WAT: %DIAG-SP-6-RUN_MINIMUM: Module 9: Running Minimal Diagnostics...
.Jul 10 19:01:59.256 WAT: %SNMP-5-MODULETRAP: Module 9 [Up] Trap
Jul 10 19:01:58.700 WAT: %DIAG-SP-6-DIAG_OK: Module 9: Passed Online Diagnostics
Jul 10 19:01:59.256 WAT: %OIR-SP-6-INSCARD: Card inserted in slot 9, interfaces are now online
.Jul 10 19:02:04.548 WAT: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks
DR DC 6509-1 .
Jul 11 09:42:05.759: %LINK-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down .
Jul 11 09:42:05.763: %SNMP-5-MODULETRAP: Module 9 [Down] Trap
.Jul 11 09:42:05.763: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to down
Jul 11 09:42:05.599: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Reset - Module Reloaded During Download)
Jul 11 09:42:05.747: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Reset - Module Reloaded During Download)
Jul 11 09:42:05.767: %LINK-SP-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down
Jul 11 09:42:05.771: %LINEPROTO-SP-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to down .
Jul 11 09:42:14.535: %SVCLC-5-SVCLCNTP: Could not update clock on the module 9, rc is -1
Jul 11 09:42:19.395: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Module not responding to Keep Alive polling)
Jul 11 09:42:19.395: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Module not responding to Keep Alive polling)
Jul 11 09:47:15.819: %DIAG-SP-6-RUN_MINIMUM: Module 9: Running Minimal Diagnostics... .
Jul 11 09:47:19.871: %MLS_RATE-4-DISABLING: The global switching mode is now 'truncated'. Disabling the Layer2 Rate Limiters. .
Jul 11 09:47:19.903: %SNMP-5-MODULETRAP: Module 9 [Up] Trap Jul 11 09:47:19.633: %DIAG-SP-6-DIAG_OK: Module 9: Passed Online Diagnostics Jul 11 09:47:19.905: %OIR-SP-6-INSCARD: Card inserted in slot 9, interfaces are now online .
Jul 11 09:47:21.079: %LINK-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down
Jul 11 09:47:20.912: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet9/1, changed state to down
Jul 11 09:47:21.080: %LINK-SP-5-CHANGED: Interface TenGigabitEthernet9/1, changed state to administratively down
.Jul 11 09:47:25.039: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks
.Jul 11 09:47:25.047: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to up
Jul 11 09:47:24.520: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet9/1, changed state to down
Jul 11 09:47:25.056: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet9/1, changed state to up
Jul 11 09:47:25.060: %LINEPROTO-SP-5-UPDOWN: Line protocol on Interface TenGigabitEthernet9/1, changed state to up
Please let me did anyone face this issue before or is it any known BUG?
07-13-2011 12:21 AM
Do you have any events in the "show diagnostic events module 9"?
07-13-2011 12:58 AM
Hi Mark,
Thanks for the reply, Here is the show diag output
REC-DCD-6509-1#sh diagnostic events module 9
Diagnostic events (storage for 500 events, 44 events recorded)
Event Type (ET): I - Info, W - Warning, E - Error
Time Stamp ET [Card] Event Message
------------------ -- ------ --------------------------------------------------
09/03 02:57:19.510 I [9] Diagnostics Passed
07/10 18:56:48.035 E [9] TestAsicSync Failed
07/10 19:01:58.696 I [9] Diagnostics Passed
rgds
Aslam
07-13-2011 02:31 AM
Hi Aslam,
Do you have any core dump file generated due to the reload? use "dir core:" command to check and compare the time stamp.
In case you have copy them out the box using copy core: command, then open a TAC case and attach them to the case for further analysis.
In case core files were not generated, then you dont have much information to check the root cause and you need to make sure that you have syslog enabled on your ACE so you can collect all needed information to troubleshoot the issue in the future.
Best regards,
Ahmad
07-15-2011 07:57 PM
Hello Aslam,
What is the last reboot reason in your ACE???
You can check that with the show version command
07-13-2011 02:44 AM
Hello Aslam!
We had the same ASIC failure with reboot on our ACE modules before (see my CSC thread #2037058) and support advised us to open a TAC case and to RMA the module. We did this, in case it was a productive enviroment there. Maybe you can check it and wait for the failure to happen again.
Cheers,
Marko
07-15-2011 10:18 PM
Hello,
As César said, it will be good to check the #show version command to determine what the reason was, it will give you the root fhe reboot exactly, in case it says: unknown, please take into consideration there are some bugs considered "silent bugs" which might trigger this type of behavior, depending on the reason of the reboot it will possible to need either an upgrade or a hardware replacement.
Hope this helps!!!
Jorge
07-18-2011 12:52 AM
HI All, Thanx for the help. Got the resaon from show version output.
last boot reason: NP 1 Failed : SRAM Parity Error Chan 3
Also got the TAC comment on SRAM party error
The SRAM parity error presented in the core file is not due to a software issue.
The issue is the result of a "bit-flip" within the SRAM itself which can occur as a
result of environmental conditions. This "bit-flip" is rectified by a simple reboot of
the system, which would occur with the generation of the core file. Cisco internal
testing and customer experience has shown that these types of issues can occur
with very low frequency, but do not required an RMA of the device.
If there are multiple instances of this issue on the same module, a proactive RMA/EFA
of the device would be in order.
ACE is susceptible to this because of the way it uses SRAM to store control information
and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a
parity error. Cisco has recognized the issue and is taking action to ensure this will not be
an issue on the next generation of the ACE module. The next generation module design
and timeline is currently under review.
Thnx again for the help
Aslam
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide