cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3056
Views
0
Helpful
7
Replies

Fabric Interconnects inexplicably rebooted today

mmedwid
Level 3
Level 3

UCS suffered a major melt-down today with both fabric interconnects deciding to reboot.  As far as I know I have no HA policy set which would cause a fabric interconnect to reboot on its own.  A similar error is on the other fabric interconnect.  My VAR is trouble-shooting this but I wanted to get some other brains on this because the impact was total.  VMWare could not fail over properly and domain controllers, SQL pretty much everything was all pooched.  I have attached the first couple hundred events from UCS.  VAR thinks I hit a bug and should just upgrade.  

cucs-1-A(nxos)# show system reset-reason
----- reset reason for Supervisor-module 1 (from Supervisor in slot 1) ---
1) At 633300 usecs after Wed Apr  9 08:53:29 2014
    Reason: Reset triggered due to HA policy of Reset
    Service: monitor hap reset
    Version: 5.0(3)N2(2.1w)

 

 

Software
  BIOS:      version 3.5.0
  loader:    version N/A
  kickstart: version 5.0(3)N2(2.1w)
  system:    version 5.0(3)N2(2.1w)
  power-seq: Module 1: version v1.0
             Module 3: version v2.0
  uC:        version v1.2.0.1
  SFP uC:    Module 1: v1.0.0.0
  BIOS compile time:       02/03/2011
  kickstart image file is: bootflash:/installables/switch/ucs-6100-k9-kickstart.
5.0.3.N2.2.1w.bin
  kickstart compile time:  2/3/2012 18:00:00 [02/03/2012 18:15:13]
  system image file is:    bootflash:/installables/switch/ucs-6100-k9-system.5.0
.3.N2.2.1w.bin
  system compile time:     2/3/2012 18:00:00 [02/03/2012 20:16:06]


Hardware
  cisco UCS 6248 Series Fabric Interconnect ("O2 32X10GE/Modular Universal Platf
orm Supervisor")
  Intel(R) Xeon(R) CPU         with 16622556 kB of memory.
  Processor Board ID FOC15485QRG

  Device name: hrk-cucs-1-A
  bootflash:   29535848 kB

Kernel uptime is 0 day(s), 6 hour(s), 35 minute(s), 12 second(s)

Last reset at 633300 usecs after  Wed Apr  9 08:53:29 2014

  Reason: Reset triggered due to HA policy of Reset
  System version: 5.0(3)N2(2.1w)
  Service: monitor hap reset

plugin
  Core Plugin, Ethernet Plugin, Fc Plugin, Virtualization Plugin

 

 

7 Replies 7

mmedwid
Level 3
Level 3

Note that UCS including fabric interconnects, attached Nexus 5548s have all run without issue for two years.  

Hi Mmedwid,

 

I believe your fabric(s) is hitting the following defect: CSCug20103

6100 Reset after repeated span show command (HAP reset)

https://tools.cisco.com/bugsearch/bug/CSCug20103


The fix to this issue is to upgrade the firmware of your UCS to 2.0(5) or 2.1(2a)

 

Let me know if you have any questions and don't forget to rate useful posts.

Release notes in 2.0(5a) showed 

CSCua91672

The fcoe_mgr hap reset will no longer cause FI reboot.

 

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/release/notes/OL_25363.html

I am really never on the Fiber Interconnects to execute any CLI commands.  But a memory leak for some other reason would certainly be a possibility. The fact that its own High Availability process causes double reset in its own High Availability architecture is troubling. 

Checking monitoring of the UCS itself (which I believe is a representation of the active Fabric Interconnect) shows that memory usage over the last year never exceeded 20%.  

Were there any core files generated?

 

You can verify that the memory usage for this process by running the following commands and monitor the memory values. This value should never keep increasing.

'show monitor internal mem-stats detail | include ETH_SPAN_MEM_show'

robertdemay
Level 1
Level 1

This just happened to one of my clusters after being online for exactly a year.  Except they have been reboot at least once do to a FW update.

Reason: Reset triggered due to HA policy of Reset

  System version: 5.2(3)N2(2.23g)

  Service: fwm hap reset

 

Memory looks find. We have over 14 clusters.  Anything I should be looking for?

Hi Robert,

 

Were any cores generated in UCSM?

 

From the reset reason that you provided, it doesn't looks like it was trigger by the same services

Service: fwm hap reset

I don't think this is the same issue.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Review Cisco Networking products for a $25 gift card