12-13-2019 09:08 PM - edited 12-13-2019 10:58 PM
Hi Community,
I recently tested a case where 3 APICs are discovered correctly, then we shut down the whole ACI fabric and power on half of the leaves, spines with one APIC (with node ID 2).
If I understood correctly, 3 APICs are only required for HA and avoid split-brain scenario (as ACI configs are segregated into shards. Each is copied into 3 replicas held by the 3 APICs, with one APIC being the master of that certain shard). So, if there's only a single APIC, it would be the master of all shards and I should still be able to configure things.
But when half of the ACI fabric was powered on, I wasn't able to configure anything. The APIC status was "Data Layer Partially Diverged", and if I were to configure anything, I would get a "The messaging layer was unable to deliver the stimulus".
So what is the point of the cluster then? If, it's a rare case but if two of my APICs suddenly failed then I wouldn't be able to configure anything, despite the network still operates normally. The surviving APIC would also be waiting for its failure as well.
The APIC model is APIC-SERVER-M3 (UCS C220-M5), with Leaves being YC-EX and GC-FXP, Spines being 9332C
Solved! Go to Solution.
12-14-2019 05:21 AM
Hi @tuanquangnguyen,
While the redundancy you describe is correct, you need a quorum (2 APICs) to actually make changes. With one epic (without that "majority" ) the fabric is in Read Only mode, still able to pass traffic but changes cannot be made. If you turn on one more APIC and the half of the fabric you have on you should be operational again.
Thats why you see the Multi Pod designs with the "spare" APIC at the site with only one product APIC so that if you lose the site with two APICs, you can promote that "spare" APIC and get back the ability to make changes.
12-14-2019 05:21 AM
Hi @tuanquangnguyen,
While the redundancy you describe is correct, you need a quorum (2 APICs) to actually make changes. With one epic (without that "majority" ) the fabric is in Read Only mode, still able to pass traffic but changes cannot be made. If you turn on one more APIC and the half of the fabric you have on you should be operational again.
Thats why you see the Multi Pod designs with the "spare" APIC at the site with only one product APIC so that if you lose the site with two APICs, you can promote that "spare" APIC and get back the ability to make changes.
12-14-2019 11:18 PM
Hi @Claudia de Luna,
Thanks for your response. I also took a read up on this topic on Cisco Live's slide, and would try to turn on two APICs instead of one the next time.
12-14-2019 11:22 PM
Yes..good information there.
The Multi Pod White Paper has an entire chapter on this and its good information as well.
https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/application-centric-infrastructure/white-paper-c11-737855.html#APICClusterDeploymentConsiderations
12-14-2019 12:06 PM - edited 12-14-2019 12:07 PM
one of the possible solution is to have 4 APICs, 3 active, 1 stand-by In the situation when half of the fabric (assuming one data centre) goes down 4th APIC will be promoted to active and fabric will be fully functional.
that is not a theory, it is fully functioning real life installation.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide