06-28-2013 02:44 PM - edited 03-01-2019 11:06 AM
Hi,
I decided to upgrade UCS from version 2.0.1q to fresh and new 2.0.5c (on 6120XP).
After I had disabled calling home, I activated the new version of UCSM (adapters, CIMC will be updated using a host fw pkg later).
And some nasty thing started to happen... FIA has suddenly rebooted. After the reboot I found out, that HA mode broke up and the management process(?) svc_sam_dme crashed.
Hopefully, the FIB took the place of FIA and the connectivity is ok, but ...
Then I forced FIB to be primary node (cluster force primary) and rebooted FIA (twice).
FIA is now OK, but FIB has gone bad (after/while FIA reboots).
And again, the problem is the same process.
SERVICE NAME STATE RETRY(MAX) EXITCODE SIGNAL CORE
------------ ----- ---------- -------- ------ ----
svc_sam_controller running 0(4) 0 0 no
svc_sam_dme failed 7(4) 0 6 yes
...
Now the status is:
UCS-A(local-mgmt)# show cluster extended-state
Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304
Start time: Fri Jun 28 17:51:42 2013
Last election time: Fri Jun 28 17:51:53 2013
A: UP, INAPPLICABLE
B: UP, INAPPLICABLE, (Management services: DOWN), FORCED-PRIMARY
A: memb state UP, lead state INAPPLICABLE, mgmt services state: UP
B: memb state UP, lead state INAPPLICABLE, mgmt services state: DOWN
heartbeat state PRIMARY_OK
INTERNAL NETWORK INTERFACES:
eth1, UP
eth2, UP
HA NOT READY
Management services are unresponsive on peer Fabric Interconnect
No device connected to this Fabric Interconnect
UCS-B(local-mgmt)# show cluster extended-state
Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304
Start time: Fri Jun 28 17:19:22 2013
Last election time: Fri Jun 28 19:51:51 2013
B: UP, INAPPLICABLE, (Management services: DOWN), FORCED-PRIMARY
A: UP, SUBORDINATE
B: memb state UP, lead state INAPPLICABLE, mgmt services state: DOWN
A: memb state UP, lead state SUBORDINATE, mgmt services state: UP
heartbeat state PRIMARY_OK
INTERNAL NETWORK INTERFACES:
eth1, UP
eth2, UP
HA NOT READY
Management services are unresponsive on local Fabric Interconnect
No device connected to this Fabric Interconnect
Now I can't force FIA to be the primary to safely reboot FIB.
UCS-A(local-mgmt)# cluster force primary
Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304
request failed: cannot accept force command when election has successfully completed
UCS-B(local-mgmt)# cluster lead a
Cluster Id: 0x6af25c32d88b11e0-0x9b66547fee02d304
request failed: client not initialized
And I'm afraid of rebooting FIB now. I would probably lose the blade connectivity.
What happens If I reboot the B interconnect (while being forced primary), when eth1,2 link is UP/OK and the heartbeat too ?
Is ther any way to manually start the "svc_sam_dme" process ?
Both FIs have the same version :
UCS-B(local-mgmt)# show version
System version: 2.0(5c)
Looks like a serious bug to me.
Any ideas ?
Thanks.
02-16-2014 11:41 AM
Dusan,
I am not sure if you edited this thread or something, but this seems to be something you posted 7 MONTHS ago ???? Do you really need help whit this situation ????
}
-Kenny
01-29-2016 03:43 AM
01-29-2016 11:58 AM
Console in to the "Inaaplicable" FI and if you are hard down, try a reboot of the FI.
IF possible, before the reboot, get the output of the show cluster ext command and attach it to this thread.
-Kenny
01-30-2016 11:51 AM
Sorry guys, I forgot to upgrade this old thread :(
I contacted the support and after one hour phonecall we managed to solve this.
The problem cause was I uploaded new (standalone) Capability Catalog BEFORE the FW upgrade (or was it just UCSM?). And there was some minor inconsistency between the bundled and the new uploaded one.
The upgrade process didn't check the compatibility (was fixed later) and after the reboot, there were a few of the system processes down, resulting in a broken HA.
We had to manually remove the catalog from the FI filesystem and manage failing processes using some undocumented system/debug/admin shell. Sorry I don't remember details. But it was all done remotely.
Later, this was confirmed as an official bug by the support team and was fixed in later versions.
I'm pretty sure you can find it somewhere in the release notes about 3 years ago (2.0.5d or above).
Can you try to upgrade to newer version ?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide