Nexus 93180YC VMware vMotion timeouts

Olof Wiking · ‎04-10-2017

Hi

We recently installed our second HPE BLC7000 bladeserver chassis. We have 6 ESXi G9 bladeserver that we want to balance between the two chassis. We have moved one of the servers to the new chassis and it works as it should. But when we use vMotion between the chassis we get a disrupt in the network traffic on the virtual machine that got moved, sometimes up to 3 minutes. VMotion of virtual servers within the same HPE BLC7000 chassis works without problems.

In the 2 HPE BLC7000 chassis we have 2 Virtual Connect FlexFabrics switches in each, they are connected to 2 Cisco Nexus 93180YC switches through port-channels that runs VPC.

When we use vMotion and moves a virtual server from an ESXi in one chassi to the other we see that it takes time before the MAC address of that server is updated on the correct port-channel in the Nexus switches. Our guess is that we have an ARP table that needs to be updated.

In VMware we have chosen "Yes" on the option "Notify switches" it was default as far as we know. And thus the ESXi host should send a gratuitous ARP or RARP to update the ARP data on the nearest switch. Though, could it be that the nearest switch in this case is the Virtual Connect FlexFabrics and that they dont update the Nexus switches?

We have read that there have been a old bug in Cisco Nexus switches regarding this and VPC. Our Nexus switches are running version 7.0(3)I5(1).

I know other companies that have a similar configuration as we have, eg. running HPE C7000 chassis with HPE Gen9 servers as ESXi hosts and Nexus switches and they dont have any issue with this.

Below is an example from one of the port-channels from the Nexus switch to the HPE BLC7000 chassi.

interface port-channel12
description BLC7000 VC1
switchport
switchport mode trunk
vpc 12

Thanks.

Olof

jdevyor · ‎09-18-2017

Hi. Having a similar issue. Did you ever get this resolved?

Wes Austin · ‎09-18-2017

Can you confirm the NIC teaming preference being used on the vSwitch?

I have experience with Cisco UCS, not HP, but based on the described behavior, it sounds like this could be the culprit....(not selecting the correct NIC teaming load balancing algorithum)

NIC teaming in ESXi and ESX (1004088)

Rick1776 · ‎10-03-2017

Are you using LACP? Also what is the hashing algorithm you are using on the switch?

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/7-x/interfaces/configuration/guide/b_Cisco_Nexus_9000_Series_NX-OS_Interfaces_Configuration_Guide_7x/b_Cisco_Nexus_9000_Series_NX-OS_Interfaces_Configuration_Guide_7x_chapter_0111...

Balaji Rajan · ‎10-07-2019

Similar issues observed with the ESXi cluster on Dell Host and connecting to Nexus9000 93180YC-EX (Normal Trunk ports on the switch that is configured as VPC pair).

ESXi uses a distributed switch.

During VM-motion between host in this cluster, MAC information only learned by the very immediate switch, peer VPC switch doesn't learn that MAC information. If the previous MAC was learned via the second peer switch, then the previous MAC entry doesn't get overwritten in the MAC table (as new MAC learning path not updated though RARP learned via Peer-Link).

The same behavior when doing VM-motion to another VM cluster.

In the same exact switch, VM-Motion on the second VM cluster works perfectly. MAC address learns on the immediate switch and also on Peer switch.

VM-Motion of Windows and Linux Guest observing similar behavior. RARP packets captured from Cluster A and Cluster B, they have very identical that received on SW1 and SW2.

Issue combination:

NXOS: 7.0(3)I6(2)
Cisco Nexus9000 93180YC-EX
VMware vSphere ESXi 6.5

Still investigating this issue!!!

Packet Herder7 · ‎10-03-2017

Look into configuring static arp resolution on the devices and see whether that helps.

Rick1776 · ‎10-03-2017

Cool, please respond to the post if that works.
Have a great day.

Rob R. · ‎01-02-2018

Having this issue as well.

Rick1776 · ‎01-04-2018

Any news on the fix you try to apply?

sschwiet · ‎03-09-2018

The issue may be caused by MAC address learning on that VLAN being disabled for 120 seconds due to too many MAC moves.

Try configuring "mac address-table notification mac-move" and "logging level l2fm 5", then running "show logging log" to see if there is MAC address flapping in your network.