today I've testet HA functionality.
We have two FI and two chassis in a streched cluster. As I switched off the power of chassis 2 and the primary FI-B, the subordinated FI-A didn't switched to primary. UCSM was not reachable, 'cause the primary FI was down.
Chassis 1 and FI-A was online. I had do connect to FI-A CLI and force primary. After I did and FI-A has become primary, the UCSM was online again.
Can someone explain me, what happened and how I can fix this?
I read something about quorum chassis and if there are an even number of chassis as the same as in my case (2 chassis), one chassis will not be designated as a quorum chassis. So there only exists a odd number of quorum chassis to participate in the HA cluster.
If I show the cluster extended-state of my FI, both chassis are listed as active. So both are quorum servers? Where can I figure out the chassis, from which is the SEEPROM used for HA?
Thanks and kindly regards
Which version of UCS ?
Did you do before the test, on each FI: CLI "show cluster status / extended"
And after power off, the same on the surviving FI.
Also, do you have a management interface monitoring policy ?
Management Interfaces Monitoring Policy
This policy defines how the mgmt0 Ethernet interface on the fabric interconnect should be monitored. If Cisco
UCS detects a management interface failure, a failure report is generated. If the configured number of failure
reports is reached, the system assumes that the management interface is unavailable and generates a fault. By
default, the management interfaces monitoring policy is disabled.
If the affected management interface belongs to a fabric interconnect which is the managing instance, Cisco
UCS confirms that the subordinate fabric interconnect's status is up, that there are no current failure reports
logged against it, and then modifies the managing instance for the endpoints.
If the affected fabric interconnect is currently the primary inside of a high availability setup, a failover of the
management plane is triggered. The data plane is not affected by this failover.
You can set the following properties related to monitoring the management interface:
• Type of mechanism used to monitor the management interface.
• Interval at which the management interface's status is monitored.
• Maximum number of monitoring attempts that can fail before the system assumes that the management
is unavailable and generates a fault message.
I am using version 2.2.1(c).
Mgmt Int. Monitoring Policy is enabled; Ping Gateway, 90 sec, 3 faults.
I tested again with a serveral possibilitys and logging the cluster states.
HA is working, if Chassis 2 is online or it is shut down before the primary FI is losing power. Only if primary FI and Chassis 2 loses the power on the same time, the HA is not working.
So I think, if both chassis are online, chassis 2 is the only quorum chassis. If the quorum chassis and the primary FI fails at the same time, HA is not working, because the cluster state says chassis 2: state active with errors.
This must be a bug ! I would immediately raise a TAC case.
I wouldn't be surprised that this special case hasn't been tested.
What is the probability that primary FI AND quorum fails at the same time.
No excuse ! you are 100% right. I hope this will not be a business critical issue.