Note that UCS including

mmedwid · ‎04-09-2014

UCS suffered a major melt-down today with both fabric interconnects deciding to reboot. As far as I know I have no HA policy set which would cause a fabric interconnect to reboot on its own. A similar error is on the other fabric interconnect. My VAR is trouble-shooting this but I wanted to get some other brains on this because the impact was total. VMWare could not fail over properly and domain controllers, SQL pretty much everything was all pooched. I have attached the first couple hundred events from UCS. VAR thinks I hit a bug and should just upgrade.

cucs-1-A(nxos)# show system reset-reason
----- reset reason for Supervisor-module 1 (from Supervisor in slot 1) ---
1) At 633300 usecs after Wed Apr 9 08:53:29 2014
Reason: Reset triggered due to HA policy of Reset
Service: monitor hap reset
Version: 5.0(3)N2(2.1w)

Software
BIOS: version 3.5.0
loader: version N/A
kickstart: version 5.0(3)N2(2.1w)
system: version 5.0(3)N2(2.1w)
power-seq: Module 1: version v1.0
Module 3: version v2.0
uC: version v1.2.0.1
SFP uC: Module 1: v1.0.0.0
BIOS compile time: 02/03/2011
kickstart image file is: bootflash:/installables/switch/ucs-6100-k9-kickstart.
5.0.3.N2.2.1w.bin
kickstart compile time: 2/3/2012 18:00:00 [02/03/2012 18:15:13]
system image file is: bootflash:/installables/switch/ucs-6100-k9-system.5.0
.3.N2.2.1w.bin
system compile time: 2/3/2012 18:00:00 [02/03/2012 20:16:06]

Hardware
cisco UCS 6248 Series Fabric Interconnect ("O2 32X10GE/Modular Universal Platf
orm Supervisor")
Intel(R) Xeon(R) CPU with 16622556 kB of memory.
Processor Board ID FOC15485QRG

Device name: hrk-cucs-1-A
bootflash: 29535848 kB

Kernel uptime is 0 day(s), 6 hour(s), 35 minute(s), 12 second(s)

Last reset at 633300 usecs after Wed Apr 9 08:53:29 2014

Reason: Reset triggered due to HA policy of Reset
System version: 5.0(3)N2(2.1w)
Service: monitor hap reset

plugin
Core Plugin, Ethernet Plugin, Fc Plugin, Virtualization Plugin

mmedwid · ‎04-09-2014

Note that UCS including fabric interconnects, attached Nexus 5548s have all run without issue for two years.

Manuel Velasco · ‎04-09-2014

Hi Mmedwid,

I believe your fabric(s) is hitting the following defect: CSCug20103

6100 Reset after repeated span show command (HAP reset)

https://tools.cisco.com/bugsearch/bug/CSCug20103

The fix to this issue is to upgrade the firmware of your UCS to 2.0(5) or 2.1(2a)

Let me know if you have any questions and don't forget to rate useful posts.

mmedwid · ‎04-10-2014

Release notes in 2.0(5a) showed

CSCua91672

The fcoe_mgr hap reset will no longer cause FI reboot.

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/release/notes/OL_25363.html

I am really never on the Fiber Interconnects to execute any CLI commands. But a memory leak for some other reason would certainly be a possibility. The fact that its own High Availability process causes double reset in its own High Availability architecture is troubling.

mmedwid · ‎04-10-2014

Checking monitoring of the UCS itself (which I believe is a representation of the active Fabric Interconnect) shows that memory usage over the last year never exceeded 20%.

Manuel Velasco · ‎04-10-2014

Were there any core files generated?

You can verify that the memory usage for this process by running the following commands and monitor the memory values. This value should never keep increasing.

'show monitor internal mem-stats detail | include ETH_SPAN_MEM_show'

robertdemay · ‎09-09-2015

This just happened to one of my clusters after being online for exactly a year. Except they have been reboot at least once do to a FW update.

Reason: Reset triggered due to HA policy of Reset

System version: 5.2(3)N2(2.23g)

Service: fwm hap reset

Memory looks find. We have over 14 clusters. Anything I should be looking for?

Manuel Velasco · ‎09-10-2015

Hi Robert,

Were any cores generated in UCSM?

From the reset reason that you provided, it doesn't looks like it was trigger by the same services

Service: fwm hap reset

I don't think this is the same issue.

Fabric Interconnects inexplicably rebooted today