Re: DNAC Node -2 not joining to DNAC Node - 1 Cluster

akhil kamalakaran · ‎07-06-2024

While re-imaging and trying to join DNAC Node -2 to DNAC Node - 1 Cluster getting failed & getting below error. Anyone got similar issue? TAC suggesting re-imaging, we did multiple time. But issue still existing.

Software Version = 2.3.5.5

Hardware Model = 44-core appliance: Cisco part number DN2-HW-APL

Error Log mentioning below

post_reboot : Fix DNS Nameservers to use node-local DNS for addon nodes
post_reboot : Fix netplan config file to drop un-necessary dns on each nodes

IntraCluster IP Address = 192.168.123.0/24

Node-1 IP = 192.168.123.11

Node-2 IP = 192.168.123.12

Node-3 IP = 192.168.123.13

ERROR:etcd.client:Request to server https://192.168.123.11:4001 failed: MaxRetryError(u'HTTPSConnectionPool(host=u\'192.168.123.11\', port=4001): Max retries exceeded with url: /v2/keys/maglev/config/node-192.168.123.11?sorted=true&recursive=true (Caused by ReadTimeoutError("HTTPSConnectionPool(host=u\'192.168.123.11\', port=4001): Read timed out. (read timeout=1)",))',)
WARNING:root:[Attempt 3] Connection to etcd failed due to MaxRetryError(u'HTTPSConnectionPool(host=u\'192.168.123.11\', port=4001): Max retries exceeded with url: /v2/keys/maglev/config/node-192.168.123.11?sorted=true&recursive=true (Caused by ReadTimeoutError("HTTPSConnectionPool(host=u\'192.168.123.11\', port=4001): Read timed out. (read timeout=1)",))',). Retrying in 4 seconds...
sudWARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host=u'192.168.123.11', port=4001): Read timed out. (read timeout=1)",)': /v2/keys/maglev/config/node-192.168.123.11?sorted=true&recursive=true

marce1000 · ‎07-07-2024

- Strange , according to https://s55ma.radioamater.si/ this may be related to (low level) , networking settings ;
have a look at /etc/netplan/vlans.yaml

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Torbjørn · ‎07-07-2024

I would guess that something is wrong with the ETCD service on node 1, alternatively that something wrong has happened resulting in the records requested being too large to be returned within the timeout. That is assuming that the network between your nodes is within spec.

Have you reimaged node 1 as well?

Happy to help! Please mark as helpful/solution if applicable.
Get in touch: https://torbjorn.dev

marce1000 · ‎07-07-2024

@Torbjørn - Following up on that I would also like to mention https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvt70976
Not sure if there is any relevance , it mentions an older version but no Known Fixed Releases neither ,

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

akhil kamalakaran · ‎07-07-2024

Yes, we re-imaged node-1 as well.

Our DNAC enterprise & cluster links are connected to Nexus ACI switches & vPC configured in Switch side and LACP in DNAC end.

VLAN's are different for Enterprise & cluster.

Torbjørn · ‎07-07-2024

Then I hope TAC will provide you a better solution than re-imaging. This sounds like a bug/some cluster state error to me.

Happy to help! Please mark as helpful/solution if applicable.
Get in touch: https://torbjorn.dev

akhil kamalakaran · ‎07-07-2024

Yes @Torbjørn we raised the tac case. TAC engineer is helpless here. Now we are requested BU team support for check this issue. Let see...

olivier vigeant · ‎07-08-2024

Hi,

I had a similar issue recently. Cluster was already setup but I had to reimage and setup everything again.

In my case, I proceeded with the VD erase using CIMC, but reboot only Node 1 to proceed to reimage. Then I've reboot node 2 and faced a similar issue after node 2 joined node 1.
I had the same behavior on 2 different clusters, and "solved" this using the same method : make sure to erase VD AND reboot your 3 nodes before starting reimage node 1.

akhil kamalakaran · ‎07-08-2024

Hi @olivier vigeant , can you share where the option is in CIMC. Can you please help.

olivier vigeant · ‎07-08-2024

Hi Akhil,

you can find all the process in the Installation Guide, search for chapter Reinitialize the Virtual Drives, follow each step, VD0 is not erased as VD1 and VD2.

Cisco DNA Center Second-Generation Appliance Installation Guide, Release 2.3.5

Regards

akhil kamalakaran · ‎07-08-2024

Hi @olivier vigeant ,

Thanks for your reply.

Actually, this we already did with the help of TAC engineer. But our problem not solved.

olivier vigeant · ‎07-08-2024

And did you reboot all 3 nodes after deleting Virtual Drives ?
Keep node 2 and 3 at the step where you choose your USB drive, and proceed to maglev-config with node 1.

akhil kamalakaran · ‎07-09-2024

Hi Olivier,

We already did this.

maflesch · ‎07-15-2024

Why is LACP being used between the ACI leaf and the Catalyst Center nodes? Unless we are utilizing both primary and secondary NICs on the DN2 appliances for NIC bonding, we shouldn't be trying to connect via LACP.

If the intent is to utilize NIC bonding with Catalyst Center over the primary and secondary NICs, are we trying to set up with Layer 2 NIC bonding or Layer 3?

As for your initial error messages, that is because ETCD cannot form quorum after starting the operation to add on another node. ETCD must have quorum once more than one member are part of the cluster. If cluster communication is failing due to the ACI fabric, then ETCD won't form quorum and you will get the error messages showing communication is breaking over port 4001 and everything will be down.

akhil kamalakaran · ‎07-16-2024

Hi @maflesch,

We have checked the connectivity between Node-2 and Node-1, and we have not observed any reachability issues between the nodes. Additionally, we have thoroughly inspected the ACI switch side and have not found any issues. We have also confirmed this with the Cisco ACI TAC engineer, who has validated our findings. No firewalls in between the nodes.