L2 Bridge Domain failure during upgrade


Can anyone help me to understand, why a recent upgrade of our ACI platform caused an outage to an ESXi system ?


Infrastructure setup (see attached diagram)


The ACI  platform has  NetApp storage connected to one pair of leaf switches and an ESXi host to another pair.

The connectivity between the two is via a L2 bridge domain (Unicast routing disabled) with flooding enabled.


Both the NetApp and ESXi hosts are multi-homed, but NOT in a VPC. (migrated from Brown Field)

Also the NetApp is configured to normally use it's connection to Leaf 3 and the ESXi to use it's connection to Leaf 1.


All switches are Gen2 EX


During a recent upgrade of the platform, Spine 1, Leaf 1 and Leaf 3, were put into the same maintenance group (ODD Switches) and were upgraded first.

During the Upgrade the ESXi host lost connectivity to the NetApp for Approx 14mins.


I know that both the NetApp and the ESXi send gratuitous arps, which in a NON ACI network would flood the changes end to end, but how does this process work in an ACI L2 Bridge Domain.


I also know that in ACI the L2 Bridge domain distributes MAC Endpoint information via the L2 Bridge Domain specific multicast tree and the the leaf switches age out the MAC's learnt, based upon the aging timers for Local and Remote Endpoints (900 and 300 seconds respectively)


So when the ODD switches reloaded as part of the upgrade, why did the traffic not move to leaf switches 2 and 4 ?


So far


I am thinking that for some reason the Local MAC Endpoints were not updated by the gratuitous ARP and the 14 minute outage was due to the Local MAC Endpoint aging timer, but why was this Entry not updated by the gratuitous ARP  and distributed over the multicast tree to the remaining leafs?



I am thinking that for some reason the Local MAC Endpoints were not updated by the gratuitous ARP and the 14 minute outage was due reload time of the Leafs.



I have now seen the option in the Bridge Domain to "Clear Remote MAC Entries", which is currently set to off. But the online help says is for VPC's so I am not sure if it is relevant to my setup or whether it would even work during a Leaf reload.



I have also considered that when the ODD switches went down, their FABRIC ISIS connections should have gone down in approx 30 seconds.

This in turn I assume should have taken down the Remote  Endpoint Tunnels on the remaining leafs, as the remote leaf would now no longer have a valid route for the Tunnel, at which point I assume the leaf would a flush the endpoints associated with the tunnel and new connections could then be setup via flooding.


Any insight to this problem would be greatly appreciated.


Thanks in advance




