cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

ACI Upgrade Preparation Best Practice and Troubleshooting

10139
Views
80
Helpful
7
Comments

 ACI upgrade involves  APIC software update and switches update.  Switch upgrade is usually very straight forward, however APIC upgrade may evolve to some cluster issues.  Here are a few pre-check list we usually recommend to customer to prepare before an upgrade get started.

Preparations for APICs Before Upgrade:

0. Clear all the faults and Overlapped VLAN Blocks

The faults of ACI fabric stand that there are invalid or conflict policies or even disconnected interfaces etc, please understand the trigger and clear them before kick in upgrade.  Please be aware, those conflict policies such as "encap already been used" or "Routed port is in L2 mode" would result of unexpected outage because ACI switch upgrade would fetch all the policies from APIC from scratch and follows "first come first serve" behavior.  As a result, the unexpected policies would very possibly take over those expected polices.

Overlapped VLAN blocks across different VLAN pool could result of intermittent packet drop as well as spanning-tree loop because of BPDU drops, please refer to this document https://supportforums.cisco.com/t5/data-center-documents/overlap-vlan-pool-lead-intermittent-packet-drop-to-vpc-endpoints/ta-p/3211107 . The Impact of overlapping vlan-pool could become more outstanding after upgrade since all policies are fetched from scratch.

1. Make sure the upgrade path is supported. There are data-conversion involved during the upgrade, follow a supported upgrade path would make sure the database are converted properly. 

***very important:Reading the release notes of target APIC version.

2.Backup APIC's configuration to an external server. By any chance if we have to re-import the configuration, this would be the only data we can restore the same configuration. If the encryption of backup is enabled, make sure the encryption key is saved, otherwise, all the passwords include admin's password would not be imported promptly, then we will have to reset the admin password from cli (the admin login to or via USB.)

3.Make sure the CIMC of APICs are accessible.  This is to avoid two risks:

a. CIMC 1.5(4e) has a memory leak defect which would lead the impacted APIC (usually APIC2 and above) won't kick off the upgrade. It would also lead APIC1's process crash post the upgrade.  You can detect if the CIMC has reached the bad state if the CIMC become not reachable either from GUI/SSH, it is very important to restore that by reset CIMC through disconnect server's power cord, wait for 3 minutes and connect back. Upgrade the CIMC before the APIC upgrade is highly

recommended https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/recommended-release/b_Recommended_Cisco_ACI_Releases.html

 b.Without CIMC access, we will not be able to access the APIC console remotely if that something went wrong, get all of this access ready before the upgrade is very critical. 

 4. Make Sure appliance element process was not locked by IPMI defect

We saw a few cases that a CentOS defect (about IPMI) would lock the AE thread. AE (appliance element) is in charge of calling the upgrade utility (installer.py), if AE is locked, the upgrade would not kick in. We can confirm whether AE is impacted by IPMI by CLI:

 grep "ipmi" /var/log/dme/log/svc_ifc_ae.bin.log | tail -5

 If there is no such hit from the IPMI output or the last IPMI query to chassis was longer than 10 seconds ago in comparison with the system current time (get by date), you may want to reboot the APIC OS before triggerring the upgrade, please do not reboot two or more apics at the same time.

5. Make sure NTP are reachable

This will avoid hitting a know issue which may result apic2-3 stuck in waiting. Details can be found in the troubleshooting cast study below.

6.Review behavior changes of new version and evaluate the potential impact. One example is that  if router control enforcement (for l3 out) was turned on for OSPF before ACI version 2.0 (it was there for BGP and was not grey out for OSPF), it would start working as soon as leaf get upgraded to 2.0, so all OSPF routes are filtered out by L3out which would cause outage.

7. Stage the Upgrade In LAB before apply the change in production. It will always be good to get familiar with the newer version by upgrading the lab, have at least a minimum test of the applications. 

 

Preparations for Switches Before Upgrade:

1. Place VPC/redundant pairs into different maintenance group.

APIC won't allow vpc pairs upgrade at the same time from a certain version and beyond, still it is  best practice to put vpc pairs into different maintenance group, for non-vpc pairs of switches which  backup each other like border leaf switches, they need be put into different groups. So that only one of member is rebooted while the other remain online.

 

Troubleshooting Upgrade:

In case the upgrade failed and troubleshooting is required, always start with APIC1, if APIC1 did not finish upgrade, please do not touch APIC2. If APIC1 is done but APIC2 did not complete, please do not touch APIC3, violate this rule could lead the cluster database broken and cluster rebuilt.

  

1. APIC2 or Above stuck at 75% even APIC1 has completed.

This problem could happen because the APIC1's upgraded version information is not propagated to APIC2 or above. Please be aware, svc_ifc_appliance_director is in charge of the version sync between APICs and store them into a framework so that upgrade utility (and other process) could read.   

First, please make sure APIC1 could ping rest of the APIC, this will determine whether we need troubleshoot from leaf switch or continue from APIC itself. If APIC1 can not ping APIC2, you may want to call TAC to troubleshoot the switch. If APIC1 could ping APIC2, then move to second step. 

Second, since APICs can talk to each other, which means APIC1's  version info should have been replicated to peer but somehow was not accepted, the version info is identified by the followed timestamp. We can run the cli below to confirm the version timestamp of APIC1 from APIC1 self and APIC2 which is waiting at 75% before complete. 

apic1# acidiag avread | grep id=1 | cut -d ' ' -f20-21
version=2.0(2f) lm(t):1(2017-10-25T18:01:04.907+11:00)

apic1# acidiag avread | grep common= | cut -d ' ' -f2
common=2017-10-25T18:01:04.907+11:00

apic2# acidiag avread | grep id=1 | cut -d ' ' -f20-21
version=2.0(1m) lm(t):1(2017-10-25T18:20:04.907+11:00)

 

As showed above on APIC2,  APIC1's (old) version 2.0(1m) is even later than APIC1's new version 2.0(2f) timestamp, this prevents APIC2 to accpeted APIC1's newer version propagation, so the installer on APIC2 think that APIC1 did not complete upgrade yet. Instead of moving to data-conversion stage, APIC2 will keep waiting for APIC1. There is a workaround which must be run from APIC1 and only when APIC1 has  completed the upgrade successfully and booted up into new version, never run this from any APICs if they are waiting at 75% , this would totally mess up.  Consider of the risk, i would suggest you call TAC instead of doing that by yourself.

Comments
Contributor

Awesome write up.I do agree he docs on Cisco’s website are a little lacking. This document is concise and I really like the examples.

Beginner

Thanks. 

Enthusiast

Thank you for this write-up!

Beginner

For Multipod deployments: Please update the article to reflect also the POD restriction the APIC will not perform a parallel upgrade of leafs that are in different pod if they are inside the same mntgroup..

 

pod1-ifc2# moquery -c maintUpgStatusCont
Total Objects shown: 1

# maint.UpgStatusCont
childAction :
dn : maintupgstatuscont
lcOwn : local
modTs : 2017-12-12T11:24:10.710-08:00
rn : maintupgstatuscont
schedulerOperQualStr : Node: 301, Policy: mt-pod-2, Check constraint: Is any other pod currently upgrading?, Result: fail, Details: Node: 301(pod: 2) cannot be upgraded, as node: 202(otherPod: 1) is upgrading. Rejecting upgrade request, node to retry periodically
schedulerTick : 182657
status :
uid : 0

 

 

It is also a good idea to check disk space on the apics:

 

apic#df -h

 

and clean up unused data (show techs, old firmware, etc).

 

Regards,

Vladimir

Beginner

awesome write up Welkin!

 

 

thanks

george g

Cisco Employee
Thank you, George. Glad to hear from you…
Beginner

Very useful write up, appreciated ....

CreatePlease to create content
Content for Community-Ad

Cisco COVID-19 Survey