Strange VMware ESX failover behaviour when connected to 4500X-VSS

Administrator Technik · ‎03-10-2015

Hello,

I am not sure if this is the right forum for my question, since it is a mixture of Cisco and VMware. So if I should repost in another forum please let me know.

Anyway, here is the issue we are currently facing

We have encountered a strange behavior in our network environment while testing our new Cisco 4500x 10g switches. If you have a TCP-Session (e.g. dd | netcat) transferring data between the two VMs on different esx-hosts, failover does not work on the receiving VM if a physical link fails.

The test setup is as follow:

| |---vmnic 4 ----- Switch 1 ---- vmnic 4 ----| |
VM1-|ESX1| |VSS| |ESX2|-VM2
| |---vmnic 5 ----- Switch 2 ---- vmnic 5 ----| |

Some more details:
- The switches are configured as a VSS cluster. Software release (Cisco IOS Software, IOS-XE Software, Catalyst 4500 L3 Switch Software (cat4500e-UNIVERSALK9-M), Version 03.07.00.E RELEASE SOFTWARE (fc4))

- there is no special configuration on the switchports, just plain trunk ports. No layer-3 just layer-2. Traffic stays within the same vlan.

interface TenGigabitEthernet2/1/10
switchport trunk allowed vlan 129,134
switchport mode trunk
logging event link-status
spanning-tree portfast trunk
spanning-tree bpdufilter enable
end

- ESX1/2 are running the latest 5.5 version
- Both vmnics are active on a Distributed V-Switch in LBT (Load Based Teaming) mode
- PortGroups are configured with “Port based on physical NIC load”
- VM1 and VM2 are linux guests, NICs are vmxnet3

Steps to reproduce:
1. Start listen mode on VM2 (netcat -l -p 12345 | dd of=/dev/null)
2. Start transfer on VM1 (dd if=/dev/zero bs=1M | netcat VM2 12345)
3. Check which link is used (in my example vmnic5 on ESX2)
4. Disconnect vmnic5 of ESX2

What you observe:
- dd-traffic dies immediately
- if you kill dd on both machines (it does not end by itself) and start over again, it works immediately
- if you ping VM2 from either VM1 or another system it works
- if you plug in vmnic5 again the VM is failed over to vmnic5 again -> if you haven’t killed dd it will continue

Now test the other way round:
1. Start listen mode on VM2 (netcat -l -p 12345 | dd of=/dev/null)
2. Start transfer on VM1 (dd if=/dev/zero bs=1M | netcat VM2 12345)
3. Check which link is used (in my example vmnic4 on ESX1)
4. Disconnect vmnic1 of ESX1

What you observe:
- Failover works like a charm
- Failback works like a charm

Conclusion:
- Only the VM on the receiving side has a problem
- Only an already established TCP-session misbehaves, new sessions work fine (as ping does, none lost)
- If you use LACP instead of LBT on the Uplinks the problem is gone
- Same problem if you connect both VM links to Switch1
- Same problem if you use a physical box with linux and LACP as sending system
- I did the same test with two c3750x (no VSS cluster) and did not encounter the problem

Further observations:

- if, instead of unplugging the cable from the switchport that connects to the receiving VM, we shutdown the port, failover works like a charm.

- if instead of unplugging the cable from the switchport that connects to the receiving VM, we power off the whole switch, failover does not work (same as unplugging the cable).

This is really mind boggling for us, so any ideas are very much appreciated.

Thanks for reading

Markus