cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
13284
Views
6
Helpful
8
Replies

Fabric Interconnects hung with "Switchover in Progress"

We had a scheduled maintenance window this afternoon that offlined both network switches connected to the management interfaces of our UCS 6248 fabric interconnects (version 2.2(1c)). There was no impact to the production uplinks/portchannels. When connectivity was restored a few minutes later, we logged into the UCS fabric interconnects, and observed that they were in "switchover in progress" mode:

amsp01fi01-A(local-mgmt)# show cluster extended-state

Cluster Id: 0x7c837140c47711e3-0xb265002a6a99d481

Start time: Mon Apr 28 19:45:32 2014

Last election time: Wed May 28 17:17:37 2014

A: UP, PRIMARY, (Management services: SWITCHOVER IN PROGRESS)

B: UP, SUBORDINATE, (Management services: SWITCHOVER IN PROGRESS)

A: memb state UP, lead state PRIMARY, mgmt services state: INVALID

B: memb state UP, lead state SUBORDINATE, mgmt services state: INVALID

   heartbeat state PRIMARY_OK

INTERNAL NETWORK INTERFACES:

eth1, UP

eth2, UP

HA NOT READY

Management services: switchover in progress on local Fabric Interconnect

Detailed state of the device selected for HA storage:

Chassis 2, serial: FOX1749GX66, state: active

Chassis 10, serial: FOX1751GX3C, state: active

Chassis 12, serial: FOX1750RGLJ, state: active

Unffortunately, they've been in this mode for more than six hours now, with no apparent changes in state. I've tried to force the issue via the "cluster lead a" and "cluster primary a" commands, but it appears that the cluster command is disabled:

amsp01fi01-A# connect local-mgmt

Cisco Nexus Operating System (NX-OS) Software

TAC support: http://www.cisco.com/tac

Copyright (c) 2009, Cisco Systems, Inc. All rights reserved.

The copyrights to certain works contained in this software are

owned by other third parties and used and distributed under

license. Certain components of this software are licensed under

the GNU General Public License (GPL) version 2.0 or the GNU

Lesser General Public License (LGPL) Version 2.1. A copy of each

such license is available at

http://www.opensource.org/licenses/gpl-2.0.php and

http://www.opensource.org/licenses/lgpl-2.1.php

amsp01fi01-A(local-mgmt)# cluster lead a

                          ^

% Invalid Command at '^' marker

amsp01fi01-A(local-mgmt)#

Fortunately, production services are not impacted; however, we cannot login to UCSM via the web gui, nor can we make any configuration changes. I'm hoping someone can suggest a minimally disruptive resolution.

Cheers,

Paul

1 Accepted Solution

Accepted Solutions

Jason West
Level 1
Level 1

We had the same issue and we open a TAC case. All we had to do was stop the service and restart it on both UCS interconnects. Here is how we did it.

pmon stop

pmon start

View solution in original post

8 Replies 8

jomartin
Cisco Employee
Cisco Employee

Hi. Unfortunately you seem to be hitting defect CSCuh92027 (FI Switchover stuck after pulliing out management cable from primary FI). The good news is that this issue is already resolved in the latest release, 2.2(2c).

As for a workaround... According to the Release Note Enclosure restarting the Data Management Engine (DME) in the primary Fabric Interconnect will correct the issue. That being said, this is not something to take lightly. The DME is the "brains" of the UCSM. I would not recommend restarting this during the middle of the day. We don't expect anything to happen, but I would better to be safe than sorry.

My suggestion is for you to open a case with TAC and do that with them. Ideally you'll have a backup of your UCS in case something goes really bad (not that I think that it will, I'm just being extremely cautious).

Jason West
Level 1
Level 1

We had the same issue and we open a TAC case. All we had to do was stop the service and restart it on both UCS interconnects. Here is how we did it.

pmon stop

pmon start

Just encountered this issue - pmon stop/start fixed and saved me form opening a TAC case

Hi George - Did the pmon stop command cause an outage or disruption?

This does not appear to disrupt production services, and takes only a few seconds to stop and start.  In system scope. 

It will only interupt the management plane and not the data plane.

I did ran pmon command it did not cause outage.

Just run that first on subordinate FI then on primary.

John Hibbs
Cisco Employee
Cisco Employee

Hi All,

TAC Server-Virtualization lead here.  I really want to add some extra info on this topic.

While at times performing a pmon stop/start on the FI is needed as a workaround to a defect, it should only be done under direction from TAC.  The expected behavior is that this should only impact MGMT plane and not the data plane, the reality is that if you have to run these commands, then you're already in an unexpected and most likely an unknown situation.  It is very possible that whatever has caused this situation to begin with could be made worse by running these commands.  It is possible that one of the many processes does not respawn properly and you could then be facing a full outage.  This is why TAC needs to first investigate what's causing this so we can avoid any potential outages.

Ultimately, do not run pmon stop/start unless TAC specifically tells you to, or that is the specific workaround to a known defect.  If you think you need to run this command, call TAC first.  Let us figure out the root cause of the process issue so we can fix it rather than just work around it.

John

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: