04-09-2014 03:31 PM - edited 03-01-2019 11:37 AM
UCS suffered a major melt-down today with both fabric interconnects deciding to reboot. As far as I know I have no HA policy set which would cause a fabric interconnect to reboot on its own. A similar error is on the other fabric interconnect. My VAR is trouble-shooting this but I wanted to get some other brains on this because the impact was total. VMWare could not fail over properly and domain controllers, SQL pretty much everything was all pooched. I have attached the first couple hundred events from UCS. VAR thinks I hit a bug and should just upgrade.
cucs-1-A(nxos)# show system reset-reason
----- reset reason for Supervisor-module 1 (from Supervisor in slot 1) ---
1) At 633300 usecs after Wed Apr 9 08:53:29 2014
Reason: Reset triggered due to HA policy of Reset
Service: monitor hap reset
Version: 5.0(3)N2(2.1w)
Software
BIOS: version 3.5.0
loader: version N/A
kickstart: version 5.0(3)N2(2.1w)
system: version 5.0(3)N2(2.1w)
power-seq: Module 1: version v1.0
Module 3: version v2.0
uC: version v1.2.0.1
SFP uC: Module 1: v1.0.0.0
BIOS compile time: 02/03/2011
kickstart image file is: bootflash:/installables/switch/ucs-6100-k9-kickstart.
5.0.3.N2.2.1w.bin
kickstart compile time: 2/3/2012 18:00:00 [02/03/2012 18:15:13]
system image file is: bootflash:/installables/switch/ucs-6100-k9-system.5.0
.3.N2.2.1w.bin
system compile time: 2/3/2012 18:00:00 [02/03/2012 20:16:06]
Hardware
cisco UCS 6248 Series Fabric Interconnect ("O2 32X10GE/Modular Universal Platf
orm Supervisor")
Intel(R) Xeon(R) CPU with 16622556 kB of memory.
Processor Board ID FOC15485QRG
Device name: hrk-cucs-1-A
bootflash: 29535848 kB
Kernel uptime is 0 day(s), 6 hour(s), 35 minute(s), 12 second(s)
Last reset at 633300 usecs after Wed Apr 9 08:53:29 2014
Reason: Reset triggered due to HA policy of Reset
System version: 5.0(3)N2(2.1w)
Service: monitor hap reset
plugin
Core Plugin, Ethernet Plugin, Fc Plugin, Virtualization Plugin
04-09-2014 03:34 PM
Note that UCS including fabric interconnects, attached Nexus 5548s have all run without issue for two years.
04-09-2014 11:04 PM
Hi Mmedwid,
I believe your fabric(s) is hitting the following defect: CSCug20103
6100 Reset after repeated span show command (HAP reset)
https://tools.cisco.com/bugsearch/bug/CSCug20103
The fix to this issue is to upgrade the firmware of your UCS to 2.0(5) or 2.1(2a)
Let me know if you have any questions and don't forget to rate useful posts.
04-10-2014 04:05 AM
Release notes in 2.0(5a) showed
CSCua91672 |
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/release/notes/OL_25363.html
I am really never on the Fiber Interconnects to execute any CLI commands. But a memory leak for some other reason would certainly be a possibility. The fact that its own High Availability process causes double reset in its own High Availability architecture is troubling.
04-10-2014 04:20 AM
04-10-2014 08:36 AM
Were there any core files generated?
You can verify that the memory usage for this process by running the following commands and monitor the memory values. This value should never keep increasing.
'show monitor internal mem-stats detail | include ETH_SPAN_MEM_show'
09-09-2015 04:25 AM
This just happened to one of my clusters after being online for exactly a year. Except they have been reboot at least once do to a FW update.
Reason: Reset triggered due to HA policy of Reset
System version: 5.2(3)N2(2.23g)
Service: fwm hap reset
Memory looks find. We have over 14 clusters. Anything I should be looking for?
09-10-2015 09:51 AM
Hi Robert,
Were any cores generated in UCSM?
From the reset reason that you provided, it doesn't looks like it was trigger by the same services
Service: fwm hap reset
I don't think this is the same issue.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide