UCSM / HA broken after upgrade 2.0.1q -> 2.0.5c

Dusan Slivon · ‎06-28-2013

Hi,

I decided to upgrade UCS from version 2.0.1q to fresh and new 2.0.5c (on 6120XP).

After I had disabled calling home, I activated the new version of UCSM (adapters, CIMC will be updated using a host fw pkg later).

And some nasty thing started to happen... FIA has suddenly rebooted. After the reboot I found out, that HA mode broke up and the management process(?) svc_sam_dme crashed.

Hopefully, the FIB took the place of FIA and the connectivity is ok, but ...

Then I forced FIB to be primary node (cluster force primary) and rebooted FIA (twice).

FIA is now OK, but FIB has gone bad (after/while FIA reboots).

And again, the problem is the same process.

SERVICE NAME STATE RETRY(MAX) EXITCODE SIGNAL CORE

------------ ----- ---------- -------- ------ ----

svc_sam_controller running 0(4) 0 0 no

svc_sam_dme failed 7(4) 0 6 yes

...

Now the status is:

UCS-A(local-mgmt)# show cluster extended-state
Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304

Start time: Fri Jun 28 17:51:42 2013
Last election time: Fri Jun 28 17:51:53 2013

A: UP, INAPPLICABLE
B: UP, INAPPLICABLE, (Management services: DOWN), FORCED-PRIMARY

A: memb state UP, lead state INAPPLICABLE, mgmt services state: UP
B: memb state UP, lead state INAPPLICABLE, mgmt services state: DOWN
heartbeat state PRIMARY_OK

INTERNAL NETWORK INTERFACES:
eth1, UP
eth2, UP

HA NOT READY
Management services are unresponsive on peer Fabric Interconnect
No device connected to this Fabric Interconnect

UCS-B(local-mgmt)# show cluster extended-state
Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304

Start time: Fri Jun 28 17:19:22 2013
Last election time: Fri Jun 28 19:51:51 2013

B: UP, INAPPLICABLE, (Management services: DOWN), FORCED-PRIMARY
A: UP, SUBORDINATE

B: memb state UP, lead state INAPPLICABLE, mgmt services state: DOWN
A: memb state UP, lead state SUBORDINATE, mgmt services state: UP
heartbeat state PRIMARY_OK

INTERNAL NETWORK INTERFACES:
eth1, UP
eth2, UP

HA NOT READY
Management services are unresponsive on local Fabric Interconnect
No device connected to this Fabric Interconnect

Now I can't force FIA to be the primary to safely reboot FIB.

UCS-A(local-mgmt)# cluster force primary

Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304

request failed: cannot accept force command when election has successfully completed

UCS-B(local-mgmt)# cluster lead a

Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304

request failed: client not initialized

And I'm afraid of rebooting FIB now. I would probably lose the blade connectivity.

What happens If I reboot the B interconnect (while being forced primary), when eth1,2 link is UP/OK and the heartbeat too ?

Is ther any way to manually start the "svc_sam_dme" process ?

Both FIs have the same version :

UCS-B(local-mgmt)# show version

System version: 2.0(5c)

Looks like a serious bug to me.

Any ideas ?

Thanks.

Keny Perez · ‎02-16-2014

Dusan,

I am not sure if you edited this thread or something, but this seems to be something you posted 7 MONTHS ago ???? Do you really need help whit this situation ????

}

-Kenny

dtaylor1978 · ‎01-29-2016

I am having a similar issue. Could someone help me out with this? I can't call TAC.

Keny Perez · ‎01-29-2016

Console in to the "Inaaplicable" FI and if you are hard down, try a reboot of the FI.

IF possible, before the reboot, get the output of the show cluster ext command and attach it to this thread.

-Kenny

Dusan Slivon · ‎01-30-2016

Sorry guys, I forgot to upgrade this old thread :(

I contacted the support and after one hour phonecall we managed to solve this.

The problem cause was I uploaded new (standalone) Capability Catalog BEFORE the FW upgrade (or was it just UCSM?). And there was some minor inconsistency between the bundled and the new uploaded one.

The upgrade process didn't check the compatibility (was fixed later) and after the reboot, there were a few of the system processes down, resulting in a broken HA.

We had to manually remove the catalog from the FI filesystem and manage failing processes using some undocumented system/debug/admin shell. Sorry I don't remember details. But it was all done remotely.

Later, this was confirmed as an official bug by the support team and was fixed in later versions.

I'm pretty sure you can find it somewhere in the release notes about 3 years ago (2.0.5d or above).

Can you try to upgrade to newer version ?