Solved: Factory a failed-discovery fabric

Timothy ACI · ‎09-04-2020

Hi community,

I got a hand on a new ACI fabric (used for development first, might become production later). Before the first discovery, I configured all APICs, then use Leaf/Spine serial number to automate the fabric node addition via API. This is done so that when I plugged cables in, the fabric nodes would be discovered and provisioned themselves. I did plug APIC 1, 2 and 3 into the Leaf, in that order. I have done this for previous fabric and it worked every time. But this time, I forgot to check the Faults tab.

Things happened. APIC 1 saw all nodes as Inactive (TEP IP are assigned from the pool, though). When I checked and issue show discoveryissues on the directly connected Leaf, I can see that DHCP, infra VLAN, time and SSL check succeeded, but LLDP said I had ctrlr-uuid-mismatch. I can also see the Leaf changed its hostname to other than (none), but the admin username from APIC 1 wasn't pushed to it (so I still logged in using admin with no password). I couldn't login to APIC 2 or 3 because apparently they're not discovered.

I tried to factory reset the APIC 1 and directly connected Leaf. During the second time of discovery, I did it manually on Web GUI and could see the Leaf, but when I added it to the fabric it was still Inactive. When I checked the Faults tab, it shows F3031 - cert-invalid (probably due to APIC's cert, but this is APIC's default manufacturer cert).

I kind of have to wait until Service Contract started so that I could open a TAC case. But then, how would I proceed to factory the other two APICs? I tried using virtual KVM on CIMC but it wouldn't let me login, either.

Thanks a lot.

Sergiu.Daniluk · ‎09-06-2020

Hi @Timothy ACI

First, verify that datetime is consistent on all APICs and Leaf/Spine switches.

Second, verify that the certificates are valid:

On APIC: acidiag verifyapic

On Leafs: cd /securedata/ssl && openssl x509 -noout -subject -in server.crt

For your reference: https://quickview.cloudapps.cisco.com/quickview/bug/CSCva68310

Correct pattern:
/serialNumber=PID:<PID> SN:<Serial number>/CN=<Serial number>

Incorrect Pattern:
/CN=<Serial number>/serialNumber=PID:<PID> SN:<Serial number>

Stay safe,

Sergiu

View solution in original post

Hector Gustavo Serrano Gutierrez · ‎09-06-2020

Hello @Timothy ACI,

Just to add to Sergiu's accurate response, you can try to login to APIC-2 and APIC-3 with username rescue-user. It should be valid only because these APICs have not been able to form a cluster with APIC-1 due to the cert issue.

Going back to the original issue, you need to urgently engage Cisco TAC to assist you in generating new certs for your APICs (just check if all of them are impacted following Sergiu's suggestion)

Regards.

View solution in original post

Hector Gustavo Serrano Gutierrez · ‎09-07-2020

Hello @Timothy ACI,

I'm glad you can now login to APICs 2 & 3. You need to verify on those if the cert fault is also seen with moquery -c faultInst -f 'fault.Inst.code=="F3031"'

If so, check the cert pattern as you are already doing.

This problem is with the APIC controllers and not the Switches. You will need to engage Cisco TAC to get support in generating new certs for the APIC controllers impacted.

Regards.

View solution in original post

Sergiu.Daniluk · ‎09-06-2020

Hi @Timothy ACI

First, verify that datetime is consistent on all APICs and Leaf/Spine switches.

Second, verify that the certificates are valid:

On APIC: acidiag verifyapic

On Leafs: cd /securedata/ssl && openssl x509 -noout -subject -in server.crt

For your reference: https://quickview.cloudapps.cisco.com/quickview/bug/CSCva68310

Correct pattern:
/serialNumber=PID:<PID> SN:<Serial number>/CN=<Serial number>

Incorrect Pattern:
/CN=<Serial number>/serialNumber=PID:<PID> SN:<Serial number>

Stay safe,

Sergiu

Timothy ACI · ‎09-06-2020

Thanks for your responses @Sergiu.Daniluk and @Hector Gustavo Serrano Gutierrez,

I was able to login to the other APICs using rescue-user. All leaves and spines all have the correct pattern for the certificate. The APIC though, while it appears on acidiag verifyapic to pass all the check, on the GUI it still shows F3031. Upon further investigation I could see it has the wrong pattern (if that applies to the APIC as well?)

Hector Gustavo Serrano Gutierrez · ‎09-07-2020

Hello @Timothy ACI,

I'm glad you can now login to APICs 2 & 3. You need to verify on those if the cert fault is also seen with moquery -c faultInst -f 'fault.Inst.code=="F3031"'

If so, check the cert pattern as you are already doing.

This problem is with the APIC controllers and not the Switches. You will need to engage Cisco TAC to get support in generating new certs for the APIC controllers impacted.

Regards.

Timothy ACI · ‎09-12-2020

Thanks Hector for your response.

Just had TAC regenerate certificates for all 3 APICs, saying there's probably a bug with my batch of APIC-SERVER-L3.

Hector Gustavo Serrano Gutierrez · ‎09-06-2020

Hello @Timothy ACI,

Just to add to Sergiu's accurate response, you can try to login to APIC-2 and APIC-3 with username rescue-user. It should be valid only because these APICs have not been able to form a cluster with APIC-1 due to the cert issue.

Going back to the original issue, you need to urgently engage Cisco TAC to assist you in generating new certs for your APICs (just check if all of them are impacted following Sergiu's suggestion)

Regards.

Sergiu.Daniluk · ‎09-16-2020

I will just leave this one here as it is related to the topic:

https://www.cisco.com/c/en/us/support/docs/field-notices/705/fn70594.html

Cheers,

Sergiu