Solved: Multi-pod and APIC distribution

Antonio Macia · ‎10-07-2019

Hi,

According to the multi-pod white paper, it is recommended to have 4 APICs in a two site multi-pod topology, where each site has 2 APICs and one of them is a standby APIC. It also says that "It is important to reiterate the point that the standby node should be activated only when Pod1 is affected by a major downtime event" What are the real implications? My customer performs disaster recovery tests at the main datacenter that would require bringing the standby APIC on the secondary DC to active in order to get the write mode and apply network changes during this status.

What are the implications (drawbacks) of promoting the standby APIC? What would be the procedure to bring the 4th APIC back to standby mode?

Thanks.

Claudia de Luna · ‎10-07-2019

Hi @Antonio Macia,

Its good to see customers executing real failure scenarios! The repeated warnings around activating the standby APIC revolve around corrupting the cluster or the shards (the database "slices" that constitute the configuration of your fabric). This can happen if you promote your standby APIC and your failure scenario really only involved taking down the IPN (and not the actual data center with the two APICs) and the IPN is restored and even worse if you have made changes to one side or the other in this now "split brain" situation.

What you client wants to do is certainly possible but must be done with care.

The only drawback of promoting your standby APIC is if you don't follow the process exactly. Its also a time consuming process and you need to make sure you always have CIMC access to all of your APICs or be ready to head into the DCs :D (you may be there already for the failure testing but its a good idea to make sure all of that is working nonetheless). Also, make sure you have a backup just in case.

So DC1 with APICs1 and 2 of a 3 APIC cluster is hard down
DC2 with APIC3 of a 3 APIC cluster plus a Standby APIC that is intended to replace APIC2 is now active.
You decommission DC1 APIC2 which now gets wiped if all goes well
You promote (replace) DC2 Standby APIC to become "new" APIC2 and restore your quorum at DC2
Eventually DC1 will come back up so you have original DC1 APIC1, new DC2 APIC2, original dC1 APIC 3 as your cluster and a wiped original DC1 APIC2

At this point you can configure original DC1 APIC2 as a standby so its gets a copy of the data and basically repeat the process above to promote it back to original APIC2, which wipes new APIC2 and then you configure the original Standby as standby again.

Here is a good write up by Valter Popeskic:
https://howdoesinternetwork.com/2019/aci-multipod-enable-standby-apic

View solution in original post

Claudia de Luna · ‎10-07-2019

Hi @Antonio Macia,

Its good to see customers executing real failure scenarios! The repeated warnings around activating the standby APIC revolve around corrupting the cluster or the shards (the database "slices" that constitute the configuration of your fabric). This can happen if you promote your standby APIC and your failure scenario really only involved taking down the IPN (and not the actual data center with the two APICs) and the IPN is restored and even worse if you have made changes to one side or the other in this now "split brain" situation.

What you client wants to do is certainly possible but must be done with care.

The only drawback of promoting your standby APIC is if you don't follow the process exactly. Its also a time consuming process and you need to make sure you always have CIMC access to all of your APICs or be ready to head into the DCs :D (you may be there already for the failure testing but its a good idea to make sure all of that is working nonetheless). Also, make sure you have a backup just in case.

So DC1 with APICs1 and 2 of a 3 APIC cluster is hard down
DC2 with APIC3 of a 3 APIC cluster plus a Standby APIC that is intended to replace APIC2 is now active.
You decommission DC1 APIC2 which now gets wiped if all goes well
You promote (replace) DC2 Standby APIC to become "new" APIC2 and restore your quorum at DC2
Eventually DC1 will come back up so you have original DC1 APIC1, new DC2 APIC2, original dC1 APIC 3 as your cluster and a wiped original DC1 APIC2

At this point you can configure original DC1 APIC2 as a standby so its gets a copy of the data and basically repeat the process above to promote it back to original APIC2, which wipes new APIC2 and then you configure the original Standby as standby again.

Here is a good write up by Valter Popeskic:
https://howdoesinternetwork.com/2019/aci-multipod-enable-standby-apic

Antonio Macia · ‎10-07-2019

Thanks your reply Claudia. Valter's post is very clarifying. I can now understand the procedure.

Regards.