The company I work for is finalizing acceptance testiing for a UCS + VMware + Nexus 1K installation, and we are encountering a problem.
First a description of the infrastructure see the attached image and below put into words:
Nortbound of the UCS we have a 2x3750's stack and a 6500.
The 3750's and 6500 are EACH connected via one1GB uplink to each of the FI modules of the UCS (cisco 6120's).This 1 link is configured as a port channel (it was configued port channel because reconfiguration once we did get 2 links/ FI later on would mean downtime)
At this moment we cannot provide the UCS with redundant links, so we have to work with what we've got.
Southbound of the UCS we have PALO adapters and blades.
Each blade has 4 vNICs, each blade has ESXi 4.1 U1 installed. All vNICs are NOT configured for fabric failover.
vNIC0. vNIC2 - FI-A
vNIC1, vNIC3 - FI-B
Fabric Interconnect is configured with CDP lossless (uplink fails, this gets signaled to vNICs who pass this on to ESXi)
we used 2 vNICs for VMware vSwitch (vNIC0, vNIC1)
we used 2 vNICs for Nexus dvSwitch. (vNIC2, vNIC3)
vSwitch has 3 profiles:
Nexus VSM management ( vlan20)
Control+packet (VLAN 10)
vSwitch is configured for EXPLICIT failover order, with NO FAILBACK. this gives us predictable traffic patterns.
Primary Adapter for VLAN 10 and VLAN 20 - vNIC0 (meaning FI-A)
Nexus has VM traffic+vMotion+Fault Tolerance
Initially the Nexus was configured for MAC-pinning AUTO.
We then decided to pin traffic to the uplinks so that same type traffic happens on the same Fabric Interconnect (for example: vMotion Traffic, control packet traffic). We have 1GB uplinks and had flaky vMotion performance because of that.
We defined 2 subgroups on the Nexus (pinning ID 2 and 3).
ID 2 was mapped to vNIC2's (FI-A)
ID 3 was mapped to vNIC3's (FI-B)
We pinned Control+Packet to ID 2
We pinned vMotion+FT to ID 3
We left VM traffic unpinned
We then did redundance testing
We disconnected FI-B from the core switches, half our vNICs went down, all ok though. VMs lost 1 PING.
We connected it back, again, no problem.
We disconnected FI-A from the core switches, half our vNICs went down, all ok though. VMs lost 1 PING.
We connected it back, here is what we noticed, and the problem:
VSM module lost the heartbeat to the VEMs (we saw it in the logs, "removing VEM heatbeats lost")
After the VEMs lost heartbeat, they disconnected all VMs.
VMs lost 8-10 seconds traffic
VEMs got reconnected and everything went back to normal.
For reference the whole pinnning story we did after we had configured "auto on mac-pinning" on the uplink profile, and had the same problem. It was a measure to have predictible troubleshooting patterns, not only to steer traffic.
We also tried to set VMware vswitches to load balancing but it didn't help (i knew it would not help...but we gave it a go anyway)
My expectation to be honest was that disconnecting and reconnecting either FI will not cause a significant disruption in network traffic (10 seconds is enough to send windows clusters spinning and cause other disruptions).
As far as we can tell...the root cause of VM traffic loss is the fact that the VEM gets disconnected from the VSM...
Does anyone have any troubleshootingtips or a solution to our problem?
I can give you more detailed configs we did, just let me know what you need.