Endpoint communication breaks randomly between a small group of EPs

sdavids5670 · ‎08-17-2021

A user reached out to explain that a system reported that it lost connectivity to a couple of other systems. When he logs into the system (via RDP) which generated the alarm and performed pings and traces to the other systems, they work. 10 minutes later, they don't work again. All three systems are connect to ACI fabric but the one system (the one he RDPs into) is communicating inter-pod with the other two.

All of the systems are virtual machines. All of the systems sit behind UCS FIs which are connected to leaf switches via VPC (FIs run in host mode I believe). When he does his traceroute, and it works, I notice that the first two hops that come back are the same gateway address. When communication breaks, and he does the traceroute again, the first hop comes back and then nothing else after that.

The first question I have is why does the same gw address return on the first two hops of the trace? That seems kind of weird. I didn't expect to see that. Does that have something to do with pervasive gateway? The next question is what kind of commands, if any, can I start running at the CLI of the spines, leaves, or IPN switches, which could tell me (historically) what might have happened to a particular endpoint (movement of an endpoint, an endpoint leaving the fabric, and endpoint joining the fabric, etc). Is there something I can look at in the ACI GUI?

The problem I have described is intermittent and has been going on for a while and team that looks at pcaps to investigate these types of application issues has had a hard time tracking this down because by the time they get involved the problem has cleared up. I'm very green with ACI and understand ACI at a 30,000 ft level but I'm pretty lost on the operational, down-in-the-weeds aspects so any suggestions would be appreciated.

Thanks

Sergiu.Daniluk · ‎08-17-2021

Hi @sdavids5670

When he logs into the system (via RDP) which generated the alarm and performed pings and traces to the other systems, they work. 10 minutes later, they don't work again.

Sounds like a "silent host" type of problem. How is the BD configured for the problematic devices? Try changing them in ARP Flooding enabled and L2 Unknown Unicast - Flood.

Stay safe,

Sergiu

sdavids5670 · ‎08-18-2021

Sergiu,

Thanks for the help. Here are the settings on the BD to which the problem systems reside:

L2 Unknown Unicast = "Flood"

ARP Flooding= Checked/Enabled

These are the BD settings for the system to which the guy RDPs (from which traceroute and ping are used to troubleshoot) - ie, the one that detects that it loses communication with the other two:

L2 Unknown Unicast = "Hardware Proxy"

ARP Flooding = Unchecked/Disabled

Regards,

Steven