Solved: apic with different version join cluster failed...

DennisX · ‎03-31-2021

We do have three APICs (C220 M4) with software version 3.0(1k), one spine with ACI firmware version n9000-13.0(2k) .

and just got two new 93180YC-FX with firmware version n9000-14.1(2g).

After the completion of cabling work, the first step I do is to upgrade one APIC from version 3.0(1k) to 4.1(2x) , then initialized the first APIC controller which have been upgraded to 4.1(2x) already, and so far so good. but after initializing the second APIC with version 3.0(1k) it failed, I do realize that this may caused by the version mismatch, so I just rest the second APIC and upgrade the version to 4.1(2x) ，but it still failed，I do run below commands and re-join to the cluster but no lucky.

acidiag touch clean
acidiag touch setup
acidiag reboot

right now we can see an error on APIC web console like below

1 object foundShow URL and response of last query
infraWiNode
dn	 topology/pod-1/node-1/av/node-2 
addr	10.0.0.2
adminSt	in-service
annotation	
apicMode	active
chassis	d50e0a8c-9017-11eb-b389-31dba2daa96e
childAction	
cntrlSbstState	approved
extMngdBy	
failoverStatus	idle
health	data-layer-synchronization-in-progress
id	2
lcOwn	local
mbSn	FCH2209V11K
modTs	2021-03-31T06:21:40.672+00:00
monPolDn	 uni/fabric/monfab-default 
mutnTs	2021-03-28T22:50:30.158+00:00
name	
nameAlias	
nodeName	APIC2
operSt	unavailable
podId	1
routableIpAddr	0.0.0.0
status	
targetMbSn	
uid	0

login to the second APIC will bring an error as below

REST Endpoint user authorization datastore is not initialized - Check Fabric Membership Status of this fabric node

login with rescue-user to apic2 and run avread

hgh-apic2# avread
Cluster:
-------------------------------------------------------------------------
fabricDomainName        HGH_ACI_Fabric1
discoveryMode           PERMISSIVE
clusterSize             3
version                 4.1(2x)
drrMode                 OFF
operSize                2

APICs:
-------------------------------------------------------------------------
                    APIC 1                  APIC 2                  
version                                   4.1(2x)                 
address           0.0.0.0                 10.0.0.2                
oobAddress        0.0.0.0                 10.224.139.8/24         
routableAddress   0.0.0.0                 0.0.0.0                 
tepAddress        0.0.0.0                 10.0.0.0/16             
podId             0                       1                       
chassisId          -.-                    cdac449e-.-793b244b     
cntrlSbst_serial  (UNDEFINED,)            (APPROVED,FCH2209V11K)  
active            NO (zeroTime)           YES                     
flags             c---                    cra-                    
health            1                       112                     
hgh-apic2# 
zsh: timeout

login to the apic1 and run avread

hgh-apic1# avread        
Cluster:
-------------------------------------------------------------------------
fabricDomainName        HGH_ACI_Fabric1
discoveryMode           PERMISSIVE
clusterSize             3
version                 4.1(2x)
drrMode                 OFF
operSize                2

APICs:
-------------------------------------------------------------------------
                    APIC 1                  APIC 2                  
version           4.1(2x)                 3.0(1k)                 
address           10.0.0.1                10.0.0.2                
oobAddress        10.224.139.7/24         10.224.139.8/24         
routableAddress   0.0.0.0                 0.0.0.0                 
tepAddress        10.0.0.0/16             10.0.0.0/16             
podId             1                       1                       
chassisId         58290ef2-.-ee1255af     d50e0a8c-.-a2daa96e     
cntrlSbst_serial  (APPROVED,FCH2209V11G)  (APPROVED,FCH2209V11K)  
active            YES                     NO                      
flags             cra-                    cra-                    
health            255                     2                       
hgh-apic1# 
zsh: timeout

any ideas?

Sergiu.Daniluk · ‎03-31-2021

Hi @DennisX

Assuming is a production environment, one should never attach a new APIC to the fabric before verifying the running ACI and CIMC version.

Anyway, from what I see, there is a chassisId mismatch between current value (the one seen in avread command on APIC2) and the value saved on APIC1 (avread output), the solution could potentially be clearing it from APIC1. However, that should be done with TAC supervision only, and only under very special circumstances.

But again, because of the scenario presented is not supported, the TAC feedback can always be to clean reload the APIC, and restore the config from backup - hopefully there is backup saved externally and not on apic.

Stay safe,

Sergiu

View solution in original post

Sergiu.Daniluk · ‎03-31-2021

Hi @DennisX

It would be easier to simply reset both APICs and Intermediary leaf and start from scratch.

Also, do not connect APIC3 to the fabric or simply shut down the leaf interface before doing the upgrade, to avoid same type of situation.

Stay safe,

Sergiu

DennisX · ‎03-31-2021

Hi @Seigiu , thank you for your reply and well noted, I do believe reset all components and upgrade the firmware to the same version before join cluster would fix the issue.

But assume that if this happened in a production environment, one APIC broken and we have a new APIC with lower version join to the exiting cluster, and it will fail to join it before we realized that we have to upgrade its firmware.

So I just want to know if we have other choice to fix it in this scenario.

Sergiu.Daniluk · ‎03-31-2021

Hi @DennisX

Assuming is a production environment, one should never attach a new APIC to the fabric before verifying the running ACI and CIMC version.

Anyway, from what I see, there is a chassisId mismatch between current value (the one seen in avread command on APIC2) and the value saved on APIC1 (avread output), the solution could potentially be clearing it from APIC1. However, that should be done with TAC supervision only, and only under very special circumstances.

But again, because of the scenario presented is not supported, the TAC feedback can always be to clean reload the APIC, and restore the config from backup - hopefully there is backup saved externally and not on apic.

Stay safe,

Sergiu

DennisX · ‎04-01-2021

Hi Sergiu,

Thanks for the clarification, really appreciate it.

will go through reset process.

Sergiu.Daniluk · ‎04-13-2021

Ironically enough, I run into a similar situation, but with APIC1 running 5.0 and APIC2 running 4.2

After some thoughts, I have found a potential solution, BUT only if you are at the very beginning with the cluster. The solution would be:

1. reduce the cluster number from 3 to 1 on the APIC1

2. touch-clean APIC2.

3. to ease the installation of newer version on APIC2, you can setup the APIC2 as a new cluster. Then upgrade from GUI.

4. touch-clean APIC2

5. bring back into initial cluster

But again, the process is so cumbersome, that one should be more careful with the version of APIC before attaching it into the fabric.

Stay safe,

Sergiu

DennisX · ‎04-13-2021

Hi Sergiu,

I appreciate your answer! To me, it’s a little tough, so I had cleaned all APIC/ SPINE/LEAFs last week and upgrade them one by one from CIMC console.
After that I re-initialized all of them and so far everything’s fine.