L2 Bridge Domain failure during upgrade

TJPETERS · ‎06-11-2019

Hi,

Can anyone help me to understand, why a recent upgrade of our ACI platform caused an outage to an ESXi system ?

Infrastructure setup (see attached diagram)

The ACI platform has NetApp storage connected to one pair of leaf switches and an ESXi host to another pair.

The connectivity between the two is via a L2 bridge domain (Unicast routing disabled) with flooding enabled.

Both the NetApp and ESXi hosts are multi-homed, but NOT in a VPC. (migrated from Brown Field)

Also the NetApp is configured to normally use it's connection to Leaf 3 and the ESXi to use it's connection to Leaf 1.

All switches are Gen2 EX

During a recent upgrade of the platform, Spine 1, Leaf 1 and Leaf 3, were put into the same maintenance group (ODD Switches) and were upgraded first.

During the Upgrade the ESXi host lost connectivity to the NetApp for Approx 14mins.

I know that both the NetApp and the ESXi send gratuitous arps, which in a NON ACI network would flood the changes end to end, but how does this process work in an ACI L2 Bridge Domain.

I also know that in ACI the L2 Bridge domain distributes MAC Endpoint information via the L2 Bridge Domain specific multicast tree and the the leaf switches age out the MAC's learnt, based upon the aging timers for Local and Remote Endpoints (900 and 300 seconds respectively)

So when the ODD switches reloaded as part of the upgrade, why did the traffic not move to leaf switches 2 and 4 ?

So far

1.

I am thinking that for some reason the Local MAC Endpoints were not updated by the gratuitous ARP and the 14 minute outage was due to the Local MAC Endpoint aging timer, but why was this Entry not updated by the gratuitous ARP and distributed over the multicast tree to the remaining leafs?

1.1

I am thinking that for some reason the Local MAC Endpoints were not updated by the gratuitous ARP and the 14 minute outage was due reload time of the Leafs.

2.

I have now seen the option in the Bridge Domain to "Clear Remote MAC Entries", which is currently set to off. But the online help says is for VPC's so I am not sure if it is relevant to my setup or whether it would even work during a Leaf reload.

3.

I have also considered that when the ODD switches went down, their FABRIC ISIS connections should have gone down in approx 30 seconds.

This in turn I assume should have taken down the Remote Endpoint Tunnels on the remaining leafs, as the remote leaf would now no longer have a valid route for the Tunnel, at which point I assume the leaf would a flush the endpoints associated with the tunnel and new connections could then be setup via flooding.

Any insight to this problem would be greatly appreciated.

Thanks in advance

Trev

skaraim · ‎07-25-2019

Hi,

did you enable GARP based detection?

https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/application-centric-infrastructure/white-paper-c11-739989.html#_Toc529820931

stcorry · ‎07-25-2019

Hello!

Firstly, if this was an urgent issue or if it was due to a bug that it's best to open a TAC case to determine root cause so that we can track the reason in our system and possibly fix if needed.

Secondly, can you show all the settings on the BD? A screenshot of both the Policy > General, and Policy > L3 Configurations tabs.

You are right, in Non ACI, these packets should be flooded. In ACI, with the correct bridge domain settings, ACI will also flood, end to end with no proxying, across all ports that are up and in the BD, essentially removing any special behavior.

Are both of these hosts configured in the same EPG? If so, can you please post the configuration of the EPG(s) as well.

What was the hosts behavior at this point?

Were you able to still access the ESX host or NetApp host through an OOB to see their failover was working properly?

What is the configuration of the host failover configuration since you mention that they are not in a vPC. Did you see any faults or anything for the IP or MAC associated with the EP?

Did you see if the EPs still showed up in their expected EPG(s)?

TJPETERS · ‎07-25-2019

Hi stcorry,

Firstly thanks for replying.

This was not particularly urgent and I was not sure if it was just my understanding of how ACI L2 Bridge Domains function, so I did not think it warranted a TAC case.

Plus also I knew I probably didn't have enough information to really justify opening a tac case as you will see below.

The main reason for the post was to see if there was something glaring obvious to someone else, that I may have configured things incorrectly or misunderstood BUM traffic in a L2 BD.

Or even maybe I was to optimistic to be able to upgrade all of our ODD leafs and Spines at the same time.

I am planning to do another upgrade, which should I experience a failure again, I would raise a TAC case for.

But I would like to add any changes that may come out of this post, before attempting another upgrade, which may resolve the issue previously experienced and work out in advance what information I would need to capture should the same issues occur.

I have attached a PDF, with both the BD and EPG screenshots, as the hosts are in the same EPG.

Re :- What was the hosts behavior at this point?

As far as we can tell, both the NetApp and the ESXi host failed over to their alternate uplinks (on different leaf switches), but the ESXi could not mount it's NFS storage.

RE :- Were you able to still access the ESX host or NetApp host through an OOB to see their failover was working properly?

This was not attempted as the change was done out of hours and the NetApp and Vmware engineers were not part of the change. Though they did look at logs the following working day.

RE:- What is the configuration of the host failover configuration since you mention that they are not in a vPC. Did you see any faults or anything for the IP or MAC associated with the EP?

As per the original post and diagram, the hosts are dual homed to different leaf switches and are operating in Standby mode, with fail over based upon Link failure. (As I understand it )

From the Logs the NetApp and VMware engineers believe their equipment failed over as expected.

RE:- Did you see if the EPs still showed up in their expected EPG(s)?

I did spot some EP's disappear from another L2 BD (different VLAN to the same NetApp and ESXi hosts), but this could have been the individual VM's failing on the ESXi host, which lost it's storage connectivity.

Probably by this point in my original change, the switches had rebooted and the issues disappeared, so further analysis was limited to the VM and ESXi logs.

regards

Trev

stcorry · ‎07-25-2019

You could enable GARP based detection on the BD under the L3 Configurations, like the other poster mentioned. You did mention that the EPs GARP when they failover. Is it possible to test with a single ESX host prior to the upgrade?

I would also say to use the ACI App Enhanced Endpoint Tracker to get a better idea of what's happening with the EPs during the failover and when they time out, etc.

TJPETERS · ‎07-26-2019

Hi stcorry,

I did not think the GARP option applied in this case, as the MAC address was not changing and remaining on the same link. Instead the MAC is being moved as is to a different link on a different leaf.

I will add the GARP option anyway to my next upgrade and see what happens and report back, note this may take a few weeks to get done.

regards

Trev

pille1234 · ‎07-27-2019

Hi!

Do you have "Graceful Maintenace" enabled for the leaf maintenace group by chance?

Regards

telliott2 · ‎07-27-2019

I had a similar issue. I was able to fix it by turning on port tracking. It is under system, system settings. What i found was when rebooting the ports on the leafs would come up and the esxi host would start sending traffic before the switch was ready to receive. The port tracking delays this so the switch has time to come online.

TJPETERS · ‎07-30-2019

Thanks for the suggestions for Graceful Maintenance and port tracking.

I know these are not enabled, I will look at adding these to my next maintenance window.

Trev