cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
4059
Views
15
Helpful
5
Replies

2nd and 3rd APICs not joining Cluster on new Fabric

colin.lynch
Level 4
Level 4

Hi I have just built a new ACI Fabric ver 4.2(3j) I configured the first APIC , registered all the nodes OK. I then built the 2nd and 3rd APICs with identical Fabric Name, ID, Pod ID and Infra VLAN However the 2nd and 3rd APICs do not join the cluster.

 

When I try and login to one of these APICs I get the message.

 

'REST Endpoint user authorization datastore is not initialized check fabric membership status of this fabric node'

 

Thinking I must have made a typo when setting up the 1st APIC I then reset all 3 APICs

'acidiag touch clean' '

acidiag touch setup'

'acidiag reboot'

 

And all Fabric Nodes

'setup-clean-config.sh'

 

I then rebuilt the fabric again, just accepting all APIC defaults , just specifying the infra VLAN 3930.

Cluster configuration ...

Enter the fabric name: ACI Fabric1

Fabric ID: 1

Number of controllers: 3

Controller name apic1

POD ID: 1

Controller ID: 1

TEP address pool: 10.0.0.0/16

Infra VLAN ID: 3930

Multicast address pool: 225.0.0.0/15

 

Again registered all nodes OK, the built 2nd APIC again accepting all defaults

Cluster configuration ...

Enter the fabric name: ACI Fabric1

Fabric ID: 1

Number of controllers: 3

Controller name apic2

POD ID: 1

Controller ID: 2

TEP address pool: 10.0.0.0/16

Infra VLAN ID: 3930

Multicast address pool: 225.0.0.0/15

 

And again the 2nd APIC did not join the cluster.and displays the below message when logging into the GUI of the 2nd APIC.

 

'REST Endpoint user authorization datastore is not initialized check fabric membership status of this fabric node'

 

I'm sure the fabric settings I typed in all match on both APICs and there are no whitespaces in Fabric names etc..

 

Any ideas?

Thanks in advance.

Colin

 

acidiag avread from both APICs

 

APIC1

apic1# acidiag avread

Local appliance ID=1 ADDRESS=10.0.0.1 TEP ADDRESS=10.0.0.0/16 ROUTABLE IP ADDRESS=0.0.0.0 CHASSIS_ID=7a8eb93a-e144-11ea-879b-65cd6756013c

Cluster of 1 lm(t):1(zeroTime) appliances (out of targeted 3 lm(t):1(2020-08-18T11:23:18.960+00:00)) with FABRIC_DOMAIN name=ACI Fabric1 set to version=4.2(3j) lm(t):1(2020-08-18T11:23:29.944+00:00); discoveryMode=PERMISSIVE lm(t):0(zeroTime); drrMode=OFF lm(t):0(zeroTime); kafkaMode=OFF lm(t):0(zeroTime)

        appliance id=1  address=10.0.0.1 lm(t):1(2020-08-18T11:21:50.220+00:00) tep address=10.0.0.0/16 lm(t):1(2020-08-18T11:21:50.220+00:00) routable address=0.0.0.0 lm(t):1(zeroTime) oob address=43.218.45.14/24 lm(t):1(2020-08-18T11:21:57.415+00:00) version=4.2(3j) lm(t):1(2020-08-18T11:21:57.902+00:00) chassisId=7a8eb93a-e144-11ea-879b-65cd6756013c lm(t):1(2020-08-18T11:21:57.902+00:00) capabilities=0X3EEFFFFFFFFF--0X2020--0X1 lm(t):1(2020-08-18T11:28:27.760+00:00) rK=(stable,present,0X206173722D687373) lm(t):1(2020-08-18T11:21:57.421+00:00) aK=(stable,present,0X206173722D687373) lm(t):1(2020-08-18T11:21:57.421+00:00) oobrK=(stable,present,0X206173722D687373) lm(t):1(2020-08-18T11:21:57.421+00:00) oobaK=(stable,present,0X206173722D687373) lm(t):1(2020-08-18T11:21:57.421+00:00) cntrlSbst=(APPROVED, WZP24231391) lm(t):1(2020-08-18T11:21:57.902+00:00) (targetMbSn= lm(t):0(zeroTime), failoverStatus=0 lm(t):0(zeroTime)) podId=1 lm(t):1(2020-08-18T11:21:50.220+00:00) commissioned=YES lm(t):1(zeroTime) registered=YES lm(t):1(2020-08-18T11:21:50.220+00:00) standby=NO lm(t):1(2020-08-18T11:21:50.220+00:00) DRR=NO lm(t):0(zeroTime) apicX=NO lm(t):1(2020-08-18T11:21:50.220+00:00) virtual=NO lm(t):1(2020-08-18T11:21:50.220+00:00) active=YES(2020-08-18T11:21:50.220+00:00) health=(applnc:255 lm(t):1(2020-08-18T11:23:04.601+00:00) svc's)

---------------------------------------------

*******Additional elements outside of cluster*******

        appliance id=2  address=10.0.0.2 lm(t):104(2020-08-18T11:51:03.608+00:00) tep address=0.0.0.0 lm(t):0(zeroTime) routable address=0.0.0.0 lm(t):0(zeroTime) oob address=0.0.0.0 lm(t):0(zeroTime) version= lm(t):0(zeroTime) chassisId=98c35844-e148-11ea-b955-d5bd81fad0b6 lm(t):104(2020-08-18T11:51:03.608+00:00) capabilities=0XFFFFFFF--0X2020--0 lm(t):0(zeroTime) rK=(stable,absent,0) lm(t):0(zeroTime) aK=(stable,absent,0) lm(t):0(zeroTime) oobrK=(stable,absent,0) lm(t):0(zeroTime) oobaK=(stable,absent,0) lm(t):0(zeroTime) cntrlSbst=(DO_SOMETHING, WZP242312SB) lm(t):104(2020-08-18T11:51:03.608+00:00) (targetMbSn= lm(t):0(zeroTime), failoverStatus=0 lm(t):0(zeroTime)) podId=1 lm(t):104(2020-08-18T11:51:03.608+00:00) commissioned=NO lm(t):1(zeroTime) registered=NO lm(t):0(zeroTime) standby=NO lm(t):104(2020-08-18T11:51:03.608+00:00) DRR=NO lm(t):0(zeroTime) apicX=NO lm(t):104(2020-08-18T11:51:03.608+00:00) virtual=NO lm(t):0(zeroTime) active=NO(zeroTime) health=(applnc:1 lm(t):0(zeroTime))

---------------------------------------------

clusterTime=<diff=1 common=2020-08-18T11:57:27.006+00:00 local=2020-08-18T11:57:27.005+00:00 pF=<displForm=0 offsSt=0 offsVlu=0 lm(t):1(2020-08-18T11:23:19.028+00:00)>>

---------------------------------------------

 

apic1#

 

APIC2

apic2 login: rescue-user

********************************************************************************

     Fabric discovery in progress, show commands are not fully functional

     Logout and Login after discovery to continue to use show commands.

********************************************************************************

apic2# acidiag avread

Local appliance ID=2 ADDRESS=10.0.0.2 TEP ADDRESS=10.0.0.0/16 ROUTABLE IP ADDRESS=0.0.0.0 CHASSIS_ID=98c35844-e148-11ea-b955-d5bd81fad0b6

Cluster of 2 lm(t):2(zeroTime) appliances (out of targeted 3 lm(t):2(zeroTime)) with FABRIC_DOMAIN name=ACI Fabric1 set to version=4.2(3j) lm(t):2(zeroTime); discoveryMode=PERMISSIVE lm(t):0(zeroTime); drrMode=OFF lm(t):0(zeroTime); kafkaMode=OFF lm(t):0(zeroTime)

        appliance id=1  address=10.0.0.1 lm(t):2(2020-08-18T11:51:16.046+00:00) tep address=0.0.0.0 lm(t):0(zeroTime) routable address=0.0.0.0 lm(t):0(zeroTime) oob address=0.0.0.0 lm(t):0(zeroTime) version= lm(t):0(zeroTime) chassisId=7a8eb93a-e144-11ea-879b-65cd6756013c lm(t):2(2020-08-18T11:51:16.045+00:00) capabilities=0XFFFFFFF--0X2020--0 lm(t):0(zeroTime) rK=(stable,absent,0) lm(t):0(zeroTime) aK=(stable,absent,0) lm(t):0(zeroTime) oobrK=(stable,absent,0) lm(t):0(zeroTime) oobaK=(stable,absent,0) lm(t):0(zeroTime) cntrlSbst=(UNDEFINED, ) lm(t):0(zeroTime) (targetMbSn= lm(t):0(zeroTime), failoverStatus=0 lm(t):0(zeroTime)) podId=0 lm(t):0(zeroTime) commissioned=YES lm(t):2(zeroTime) registered=YES lm(t):2(2020-08-18T11:51:16.901+00:00) standby=NO lm(t):0(zeroTime) DRR=NO lm(t):0(zeroTime) apicX=NO lm(t):0(zeroTime) virtual=NO lm(t):0(zeroTime) active=NO(zeroTime) health=(applnc:1 lm(t):0(zeroTime))

        appliance id=2  address=10.0.0.2 lm(t):2(2020-08-18T11:51:04.892+00:00) tep address=10.0.0.0/16 lm(t):2(2020-08-18T11:51:04.892+00:00) routable address=0.0.0.0 lm(t):2(zeroTime) oob address=43.218.45.15/24 lm(t):2(2020-08-18T11:51:10.698+00:00) version=4.2(3j) lm(t):2(2020-08-18T11:51:10.793+00:00) chassisId=98c35844-e148-11ea-b955-d5bd81fad0b6 lm(t):2(2020-08-18T11:51:10.793+00:00) capabilities=0X3EEFFFFFFFFF--0X2020--0X2 lm(t):2(2020-08-18T11:51:10.793+00:00) rK=(stable,present,0X206173722D687373) lm(t):2(2020-08-18T11:51:10.704+00:00) aK=(stable,present,0X206173722D687373) lm(t):2(2020-08-18T11:51:10.704+00:00) oobrK=(stable,present,0X206173722D687373) lm(t):2(2020-08-18T11:51:10.704+00:00) oobaK=(stable,present,0X206173722D687373) lm(t):2(2020-08-18T11:51:10.704+00:00) cntrlSbst=(APPROVED, WZP242312SB) lm(t):2(2020-08-18T11:51:10.793+00:00) (targetMbSn= lm(t):0(zeroTime), failoverStatus=0 lm(t):0(zeroTime)) podId=1 lm(t):2(2020-08-18T11:51:04.892+00:00) commissioned=YES lm(t):2(zeroTime) registered=YES lm(t):2(2020-08-18T11:51:04.892+00:00) standby=NO lm(t):2(2020-08-18T11:51:04.892+00:00) DRR=NO lm(t):0(zeroTime) apicX=NO lm(t):2(2020-08-18T11:51:04.892+00:00) virtual=NO lm(t):2(2020-08-18T11:51:04.892+00:00) active=YES(2020-08-18T11:51:04.892+00:00) health=(applnc:112 lm(t):2(2020-08-18T11:51:35.795+00:00) svc's)

---------------------------------------------

clusterTime=<diff=0 common=2020-08-18T11:55:36.619+00:00 local=2020-08-18T11:55:36.619+00:00 pF=<displForm=1 offsSt=0 offsVlu=0 lm(t):0(zeroTime)>>

---------------------------------------------

 

apic2#

1 Accepted Solution

Accepted Solutions

Hi @Sergiu.Daniluk 

 

The issue is now resolved, I had to contact Cisco TAC as I was right out of ideas.

I was hitting bug CSCvu62127 where some new APICs where shipped with invalid certificates.

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvu62127

 

Before:

EUGBCHESVR01-APIC-ACI# acidiag verifyapic

openssl_check: certificate details

subject= serialNumber=PID:APIC-SERVER-M3 SN:WZP242406QN,CN=WZP242406QN

issuer= CN=Cisco Manufacturing CA,O=Cisco Systems

notBefore=Jul 30 07:25:43 2020 GMT

notAfter=May 14 20:25:41 2029 GMT

openssl_check: passed

ssh_check: passed

all_checks: passed

 

After:

EUGBCHESVR01-APIC-ACI# acidiag verifyapic

openssl_check: certificate details

subject= CN=WZP242406QN,serialNumber=PID:APIC-SERVER-M3 SN:WZP242406QN

issuer= CN=Cisco Manufacturing CA,O=Cisco Systems

notBefore=Aug 21 11:52:40 2020 GMT

notAfter=May 14 20:25:41 2029 GMT

openssl_check: passed

ssh_check: passed

all_checks: passed

 

Also, we used below commands to verify:

 

  • moquery -c faultInst -f 'fault.Inst.code=="F3031"'

Total Objects shown: 1

 

# fault.Inst

code             : F3031

ack              : no

annotation       :

cause            : cert-invalid

 

Thanks for the help.

Regards

Colin

View solution in original post

5 Replies 5

Sergiu.Daniluk
VIP Alumni
VIP Alumni

Hi @colin.lynch 

You can verify that the setup script values match using these two commands:

APIC# cat /data/data_admin/sam_exported.config
APIC# avread

*Note that th second command is "avread" and not the "acidiag avread" (though output of both contains the same info but the new cmd is cleaner and easier to read).

Also, there is also this command available in 4.2, for troubleshooting APIC clustering issues:

apic1# acidiag cluster

Would you mind sharing the outputs of these commands?

 

Stay safe,

Sergiu

Thanks for the quick response, and useful commands @Sergiu.Daniluk 

 

I have attached a side by side screen shot of the following outputs

APIC# cat /data/data_admin/sam_exported.config
APIC# avread

 

 

apic1# acidiag cluster

Outputs below

APIC1

EUGBCHESVR01-APIC-ACI# acidiag cluster
Admin password:

Product-name = APIC-SERVER-M3
Serial-number = WZP24231391
Running...

Operational cluster size and target cluster size are not identical
Checking Core Generation: OK
Checking Wiring and UUID: switch(103) reports apic(2) has wireIssue: ctrlr- uuid-mismatch; switch(104) reports apic(2) has wireIssue: ctrlr-uuid-mismat ch
Checking AD Processes: Running
Checking All Apics in Commission State: OK
Checking All Apics in Active State: OK
Checking Fabric Nodes: OK
Checking Apic Fully-Fit: OK
Checking Shard Convergence: OK
Ping OOB IPs:
APIC-1: 43.218.45.14 - OK
Ping Infra IPs:
APIC-1: 10.0.0.1 - OK
Checking APIC Versions: Same (4.2(3j))
Checking SSL: OK

Done!

EUGBCHESVR01-APIC-ACI#

 

APIC2

apic2# acidiag cluster
Admin password:

Product-name = APIC-SERVER-M3
Serial-number = WZP242312SB
Running...

Operational cluster size and target cluster size are not identical
Checking Core Generation: OK
Checking Wiring and UUID: Password Required!
Checking AD Processes: Running
Checking All Apics in Commission State: OK
Checking All Apics in Active State: Inactive Apics: IFC-1
Checking Fabric Nodes: OK
Checking Apic Fully-Fit: Not Fully Fit Apics: IFC-1 IFC-2
Checking Shard Convergence: Password Required!
Ping OOB IPs:
APIC-1: Decommissioned. Invalid IP
APIC-2: 43.218.45.15 - OK
Ping Infra IPs:
APIC-1: 10.0.0.1 - Ping failed
APIC-2: 10.0.0.2 - OK
Checking APIC Versions: Cluster Version:4.2(3j) Imcompatible Apics: IFC-1()
Checking SSL: OK

 

Looking at the above outputs, it seems each APIC2 cannot reach APIC1.

It mentions that Leaf 103 and Leaf 104 are detecting a 'wire issue' on APIC 2 but I have confirmed that APIC2 is connected to the fabric correctly (Ports 1 and 3) this checks on on the LLDP tables on both leafs. APIC 1 is connected identically to leafs 101/102.

 

in the meantime I'll see if APIC 1 can PING all the other TEPs on the leafs/spines on the infra VLAN , same on APIC 2. As that is what seems to be the main issue.

 

Regards

Colin

Hi @colin.lynch 

From the outputs you shared, I could see this issue:

Checking Wiring and UUID: switch(103) reports apic(2) has wireIssue: ctrlr- uuid-mismatch; switch(104) reports apic(2) has wireIssue: ctrlr-uuid-mismat ch

This  is not a wiring issue. Is actually a UUID mismatch.

if you look at the avread output, you will see that the APIC1 reports APIC2's chassisID as 98c35844-.-blabla, while the same output on APIC2 shows different local chassisId - 327d4a22-.-blabla

APIC1's UUID is the same on both chassis.  Can you verify the same output on APIC3? Could be possible that the APIC # to be overlapping.

 

Stay safe,

Sergiu

Hi @Sergiu.Daniluk 

 

The issue is now resolved, I had to contact Cisco TAC as I was right out of ideas.

I was hitting bug CSCvu62127 where some new APICs where shipped with invalid certificates.

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvu62127

 

Before:

EUGBCHESVR01-APIC-ACI# acidiag verifyapic

openssl_check: certificate details

subject= serialNumber=PID:APIC-SERVER-M3 SN:WZP242406QN,CN=WZP242406QN

issuer= CN=Cisco Manufacturing CA,O=Cisco Systems

notBefore=Jul 30 07:25:43 2020 GMT

notAfter=May 14 20:25:41 2029 GMT

openssl_check: passed

ssh_check: passed

all_checks: passed

 

After:

EUGBCHESVR01-APIC-ACI# acidiag verifyapic

openssl_check: certificate details

subject= CN=WZP242406QN,serialNumber=PID:APIC-SERVER-M3 SN:WZP242406QN

issuer= CN=Cisco Manufacturing CA,O=Cisco Systems

notBefore=Aug 21 11:52:40 2020 GMT

notAfter=May 14 20:25:41 2029 GMT

openssl_check: passed

ssh_check: passed

all_checks: passed

 

Also, we used below commands to verify:

 

  • moquery -c faultInst -f 'fault.Inst.code=="F3031"'

Total Objects shown: 1

 

# fault.Inst

code             : F3031

ack              : no

annotation       :

cause            : cert-invalid

 

Thanks for the help.

Regards

Colin

Glad to hear that the issue is resolved :-)

 

Cheers,

Sergiu

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Save 25% on Day-2 Operations Add-On License