This must be a bug ! I would

Danny Sandner · ‎03-18-2014

Hi there,

today I've testet HA functionality.

We have two FI and two chassis in a streched cluster. As I switched off the power of chassis 2 and the primary FI-B, the subordinated FI-A didn't switched to primary. UCSM was not reachable, 'cause the primary FI was down.

Chassis 1 and FI-A was online. I had do connect to FI-A CLI and force primary. After I did and FI-A has become primary, the UCSM was online again.

Can someone explain me, what happened and how I can fix this?

I read something about quorum chassis and if there are an even number of chassis as the same as in my case (2 chassis), one chassis will not be designated as a quorum chassis. So there only exists a odd number of quorum chassis to participate in the HA cluster.

If I show the cluster extended-state of my FI, both chassis are listed as active. So both are quorum servers? Where can I figure out the chassis, from which is the SEEPROM used for HA?

Thanks and kindly regards

Danny

Walter Dey · ‎03-18-2014

Which version of UCS ?

Did you do before the test, on each FI: CLI "show cluster status / extended"

And after power off, the same on the surviving FI.

Also, do you have a management interface monitoring policy ?

-------------------------------------------------------------------------------------------

Management Interfaces Monitoring Policy

This policy defines how the mgmt0 Ethernet interface on the fabric interconnect should be monitored. If Cisco

UCS detects a management interface failure, a failure report is generated. If the configured number of failure

reports is reached, the system assumes that the management interface is unavailable and generates a fault. By

default, the management interfaces monitoring policy is disabled.

If the affected management interface belongs to a fabric interconnect which is the managing instance, Cisco

UCS confirms that the subordinate fabric interconnect's status is up, that there are no current failure reports

logged against it, and then modifies the managing instance for the endpoints.

If the affected fabric interconnect is currently the primary inside of a high availability setup, a failover of the

management plane is triggered. The data plane is not affected by this failover.

You can set the following properties related to monitoring the management interface:

• Type of mechanism used to monitor the management interface.

• Interval at which the management interface's status is monitored.

• Maximum number of monitoring attempts that can fail before the system assumes that the management

is unavailable and generates a fault message.

Danny Sandner · ‎03-19-2014

I am using version 2.2.1(c).

Mgmt Int. Monitoring Policy is enabled; Ping Gateway, 90 sec, 3 faults.

...

I tested again with a serveral possibilitys and logging the cluster states.

HA is working, if Chassis 2 is online or it is shut down before the primary FI is losing power. Only if primary FI and Chassis 2 loses the power on the same time, the HA is not working.

So I think, if both chassis are online, chassis 2 is the only quorum chassis. If the quorum chassis and the primary FI fails at the same time, HA is not working, because the cluster state says chassis 2: state active with errors.

Right?

Walter Dey · ‎03-19-2014

This must be a bug ! I would immediately raise a TAC case.

I wouldn't be surprised that this special case hasn't been tested.

What is the probability that primary FI AND quorum fails at the same time.

No excuse ! you are 100% right. I hope this will not be a business critical issue.

UCSM HA functionality - quorum chassis