Re: ACI endpoints periodically not reachable

Lamont Bullock · ‎02-07-2025

Hi all. My organization's VMWare Team has hypervisors distributed all over our ACI fabric. All the hypervisors have a management network that allows them to communicate with vCenter. Recently (about 2 weeks ago) all the hypervisors started to periodically disconnect from vCenter for random periods of time and reconnect later. During their blackout periods the hypervisors are not reachable from within the ACI fabric or from hosts external to the fabric. Not all hypervisors are unreachable at the same time but there is no clear pattern as to which ones lose connectivity. Two hypervisors connected to the same pair of Leaf switches can be behaving differently at the same time. Other networks with ACI and shared with the vCenter cluster are not experiencing random disconnects, only the network support hypervisor to vCenter communication.

Sorry I am not able to post any CLI output as the environment is air gapped and we are prohibited from moving the data from that network to this one.

I can describe to you our setup:

Each hypervisor has two 100G NICs connected to two Leaf switches.
All the EPGs for VMWare networks (such as: Fault tolerance, vMotion, Storage & Management) and VMs are pushed over the pair of links using AAEPs.

ACI does not have a VMM domain setup. All the internal virtual networking is handled by the vCenter virtual distributed switches.

The Bridge Domain does have 4 subnets under it, and one of the subnets is the one having the problem. Each subnet has its own EPG. The problem subnet is designated as the primary subnet.
We do have other Bridge domains with multiple subnets with their own EPGs setup the same way this one is, and they are not exhibiting the problem.

While the endpoints are not reachable on the network, the Endpoint tracker is still able to locate them

From the Leaf switch the hypervisor is attached to, I still cannot ping the hypervisor when it's in the unreachable state, but able to ping it when it's working nominally.

Has anyone in the community run across something this and have a solution or have any ideas on how to begin troubleshooting it?

Thanks for any help you can provide!!

Lamont

AshSe · ‎02-11-2025

Hi @Lamont Bullock

To better understand your scenario, may I ask you few questions:

Frequency and Pattern of Disconnects:
1. How often do the disconnects occur? Is there a specific time interval or pattern (e.g., every few hours, during peak traffic, etc.)?
2. How long do the disconnects typically last before the hypervisors reconnect to vCenter?
3. Are the disconnects happening during specific events, such as backups, vMotion, or other high-traffic activities?
Scope of the Issue:
1. Are all hypervisors affected at some point, or are there specific hypervisors that are more prone to disconnects?
2. Are the disconnects isolated to a specific Leaf switch pair, or do they occur across multiple Leaf switches in the fabric?
3. Are there any other devices or endpoints in the same EPG/subnet as the hypervisors that are experiencing similar connectivity issues?
Impact on Other Networks:
1. You mentioned that other networks (e.g., Fault Tolerance, vMotion, Storage) are not experiencing issues. Are these networks on the same Bridge Domain or a different one?
2. Are there any shared resources (e.g., firewalls, load balancers, or external routers) that could be impacting only the management network?
Endpoint Behavior:
1. When the hypervisors are unreachable, does the ACI Endpoint Tracker show them as "Learned" on the correct Leaf switch and port?
2. Are there any signs of endpoint flapping (frequent re-learning of the endpoint) in the ACI fabric during the disconnects?
3. Are there any duplicate IP or MAC address conflicts in the fabric that could be causing intermittent connectivity issues?
Bridge Domain and Subnet Configuration:
1. Is the "Primary Subnet" flag enabled for the problem subnet? If so, is it necessary for this setup, or could it be causing unintended behavior?
2. Are the subnets in the Bridge Domain configured with "Unicast Routing" enabled or disabled? If enabled, is there a specific external L3Out or routing configuration tied to this Bridge Domain?
3. Are there any overlapping IP subnets between this Bridge Domain and other parts of the network?
Leaf Switch Behavior:
1. When the hypervisors are unreachable, are there any relevant logs or faults on the Leaf switches they are connected to (e.g., interface errors, drops, or policy misconfigurations)?
2. Are the Leaf switch interfaces configured with the correct AEP and policy groups (e.g., VLAN encapsulation, MTU, etc.)?
3. Are there any signs of high CPU or memory utilization on the Leaf switches during the disconnects?
vCenter and Hypervisor Configuration:
1. Are the hypervisors configured with static IPs for the management network, or are they using DHCP? If DHCP, could there be lease renewal issues?
2. Are there any recent changes to the vCenter configuration, such as updates, patches, or changes to the virtual distributed switch (VDS) settings?
3. Are the hypervisors running the same version of ESXi, or are there version mismatches that could be contributing to the issue?
External Connectivity:
1. Are there any firewalls, ACLs, or contracts in the ACI fabric that could intermittently block traffic to/from the management network?
2. Is there any external routing or NAT involved for the management network? If so, could there be issues with ARP or routing table updates?
Recent Changes:
1. Were there any changes made to the ACI fabric, vCenter, or hypervisor configuration around the time the issue started (e.g., firmware upgrades, policy changes, new subnets added)?
2. Were there any changes to the physical network, such as cabling, switch replacements, or new devices added to the fabric?
Monitoring and Troubleshooting:
1. Have you checked the ACI fabric logs (e.g., faults, events, audit logs) for any anomalies or errors related to the affected hypervisors or EPGs?
2. Have you reviewed the vCenter logs for any errors or warnings during the disconnect periods?
3. Have you performed packet captures on the Leaf switch interfaces or the hypervisors to identify any dropped or malformed packets?

During their blackout periods the hypervisors are not reachable from within the ACI fabric or from hosts external to the fabric. Not all hypervisors are unreachable at the same time but there is no clear pattern as to which ones lose connectivity.

From your explanation, seems there is no VMM Integration; Can you please throw some light on: Connectivity between Hypervisors and ACI fabric. Are they part of the same management network?

Would be better if you could draw and share connectivity diagrams.

Best Wishes to do the best!

AshSe

Lamont Bullock · ‎02-12-2025

AshSe,

Thank you for responding to my post. I will try to answer your questions as thoroughly as possible. Please see answer below or next to each question.

To better understand your scenario, may I ask you few questions:

Frequency and Pattern of Disconnects:

How often do the disconnects occur? Is there a specific time interval or pattern (e.g., every few hours, during peak traffic, etc.)? The disconnects are happening all day long
How long do the disconnects typically last before the hypervisors reconnect to vCenter? The disconnect times are random. Some hypervisors disconnect for a short period and then reconnect while other will stay disconnected as long as 30 minutes. I think in some cases our VM team is able to manually initiate a reconnect and will happen but other times it will not
Are the disconnects happening during specific events, such as backups, vMotion, or other high-traffic activities? No specific activities can be identified. They are occurring throughout the day
Scope of the Issue:

Are all hypervisors affected at some point, or are there specific hypervisors that are more prone to disconnects? The hypervisors that are experiencing the disconnects seem to be ones we migrated from an older tenant to the current one. The hypervisors that were build in the new tenancy don't experience the disconnects.
Are the disconnects isolated to a specific Leaf switch pair, or do they occur across multiple Leaf switches in the fabric? Leafs switches spread throughout the fabric. Leafs spread throughout the fabric are having the issues. The leafs with hypervisors build in the new tenancy are not have the problem. All the hypervisors share the same bridge-domains and EPGs and policies too.
Are there any other devices or endpoints in the same EPG/subnet as the hypervisors that are experiencing similar connectivity issues? My understanding is only the hypervisor management network is impacted by the disconnects, but as a secondary symptom the storage networks the hypervisors mount to are experiencing some funny symptoms, but it was explained it is because the hypervisors are disconnecting from vCenter.
Impact on Other Networks:

You mentioned that other networks (e.g., Fault Tolerance, vMotion, Storage) are not experiencing issues. Are these networks on the same Bridge Domain or a different one? The Fault Tolerance, vMotion and Storage are not experiencing periods of unreachability. They are all on different bridge domains. One thing of not is each of these bridge domains has multiple subnets associated with them. For example: VM Management Bridge domain, has 4 subnets and each subnet has an EPG associated with it. The Faul tolerance also has 1 BD and 4 EPGs. This is because early on we had different clusters of hypervisors, and we want to consolidate them into one BD and EPG. We grouped the EPGs for the different clusters into a single BD so we can re-ip them into the primary BD. It's the Primary VM Management subnet that is having the disconnect issues.
Are there any shared resources (e.g., firewalls, load balancers, or external routers) that could be impacting only the management network? No
Endpoint Behavior:

When the hypervisors are unreachable, does the ACI Endpoint Tracker show them as "Learned" on the correct Leaf switch and port? Yes, but they are not pingable from inside or outside the ACI fabric.
Are there any signs of endpoint flapping (frequent re-learning of the endpoint) in the ACI fabric during the disconnects? I haven't checked for this, but I will now. thanks for the tip
Are there any duplicate IP or MAC address conflicts in the fabric that could be causing intermittent connectivity issues? I haven't checked for this, but I will now. thanks for the tip
Bridge Domain and Subnet Configuration:

Is the "Primary Subnet" flag enabled for the problem subnet? If so, is it necessary for this setup, or could it be causing unintended behavior? The Primary subnet subnet is checked. The VM remotely kickstarts hypervisors using this network and since there are multiple subnets associated with the BD we have to check the box for DHCP relay to work. Come to think of it, it may have been when we checked this box when connections started to flake in and out. We will test this theory. Thanks again.
Are the subnets in the Bridge Domain configured with "Unicast Routing" enabled or disabled? If enabled, is there a specific external L3Out or routing configuration tied to this Bridge Domain? Unicast routing is enabled and the BD is associated with an L3Out
Are there any overlapping IP subnets between this Bridge Domain and other parts of the network? No
Leaf Switch Behavior:

When the hypervisors are unreachable, are there any relevant logs or faults on the Leaf switches they are connected to (e.g., interface errors, drops, or policy misconfigurations)? Nothing we have been able to see.
Are the Leaf switch interfaces configured with the correct AEP and policy groups (e.g., VLAN encapsulation, MTU, etc.)? Yes. we triple checked. The AEPs are pushing a lot of EPGs, so we checked the encapsulations, trunk mode and EPG names are the same. We checked for duplicate encapsulations and names too.
Are there any signs of high CPU or memory utilization on the Leaf switches during the disconnects? No.
vCenter and Hypervisor Configuration:

Are the hypervisors configured with static IPs for the management network, or are they using DHCP? If DHCP, could there be lease renewal issues? The Hypervisors have static IPs after being kickstarted.
Are there any recent changes to the vCenter configuration, such as updates, patches, or changes to the virtual distributed switch (VDS) settings? I'll check with the VM team on this. Things were stable for a while after our tenancy migration and then one day things went haywire. I thought all the HVs were patched right after the migration but I'll check to see if another update was rolled out.
Are the hypervisors running the same version of ESXi, or are there version mismatches that could be contributing to the issue? Yes, all running the same version.
External Connectivity:

Are there any firewalls, ACLs, or contracts in the ACI fabric that could intermittently block traffic to/from the management network? No all EPGs are using the same contracts
Is there any external routing or NAT involved for the management network? If so, could there be issues with ARP or routing table updates? No
Recent Changes:

Were there any changes made to the ACI fabric, vCenter, or hypervisor configuration around the time the issue started (e.g., firmware upgrades, policy changes, new subnets added)? From what I can tell, no network changes were made except the Primary subnet flag may have been added to resolve a DHCP relay around the time. So I will check into this further.
Were there any changes to the physical network, such as cabling, switch replacements, or new devices added to the fabric? Not when then problem started. Things have been done since.
Monitoring and Troubleshooting:

Have you checked the ACI fabric logs (e.g., faults, events, audit logs) for any anomalies or errors related to the affected hypervisors or EPGs? I didn't see any ACI logs that stuck out as being related to this problem. There are a lot of faults that existed since the migration that I am still trying to understand and resolve but they were present before this problem occurred.
Have you reviewed the vCenter logs for any errors or warnings during the disconnect periods? The VM team was going to look into those. I will have to follow up.
Have you performed packet captures on the Leaf switch interfaces or the hypervisors to identify any dropped or malformed packets? No. Unfortunately packet captures are prohibited in this environment without approval from way way way high up. Time and time again we are able to get to the bottom of things with a packet capture, but it is such a pain to get approval.

Thanks again for the great questions and tips. Your questions have given me some points to look into. have a good night.

Lamont

Vendan · ‎02-27-2025

I am also encountering the same issue. If you happen to find a solution, please let me know. Thank you.

meryjane97132 · ‎02-27-2025

It is like a possible issue with endpoint learning or an ARP/ND timeout mismatch. Have you checked the aging timers on the bridge domain moon phase soulmate calculator and whether dynamic endpoint movement is causing inconsistencies? Might be worth looking into contract enforcement as well.