I am using a Nexus 1000v a FI 6248 with a Nexus 5K in redundant architecture and I have a strange bevahior with VMs.
I am using port-profiles without any problems but in one case I have this issue
I have 2 VMs assigned to the same port profile
When the 2 Vms are on the same esx I can ping (from a VM) the gateway and the other VM, now when I move one of the VM to an other ESX (same chassis or not).
From both , I can ping the gateway, a remote IP but VMs are unreachable between them.
and a remote PC are able to ping both Vms.
I checked the mac table, from N5k it's Ok , from FI 6348 it's Ok , but from N1K I am unable to see the mac address of both VMs.
Why I tried ( I performed at each step a clear mac table)
Assign to an other vmnic , it works.
On UCS I moved it to an other vmnic , it works
On UCS I Changed the QOS policy , it works.
I reassigned it , and I had the old behavior
I checked all trunk links it's ok
So i didn't understand why I have this strange behavior and how I can troubleshoot it deeper?
I would like if possible to avoid to do that but the next step will be to create a new vmnic card and assign the same policy and after to suppress the vnmic and to recreate the old one.
No , I am still working with the Cisco support but they didn't find any problem of the configuration all seem correclty configured.
the next step are to
1 - Delete all vlans and vmnic impacted by the problem and to recreate.
2 - perform a upgrade of FI and N1K
I will try to do it before the end of the week but I need to do it with care because some VMs are now in production but with a VDS architecture .
We were able to resolve the problem on the two VM's by manually changing the pinning id on one the VM veth interfaces. We still do not know why the issue has occured with just these two VM's
From what you mentioned here's my thoughts.
When the two VMs are on the same host, they can reach each other. This is because they're locally switching in the VEM so this doesn't tell us much other than the VEM is working as expected.
When you move one of the VMs to a different UCS ESX host, the path changes. Let's assume you've moved one VM to a different host, within the UCS system.
UCS-Blade1(Host-A) - VM1
UCS-Blade2(Host-B) - VM2
There are two paths option from VM1 -> VM2
VM1 -> Blade1 Uplink -> Fabric Interconnect A -> Blade 2 Uplink -> VM2
VM1-> Blade1 Uplink -> Fabric Interconnect A -> Upstream Switch -> Fabric Interconnect B -> Blade 2 Uplink -> VM2
For the two options I've seen many instances were the FIRST option works fine, but the second doesn't. Why? Well as you can see option 1 has a path from Host A to FI-A and back down to Host B. In this path there's no northbound switching outside of UCS. This would require both VMs to be be pinned to the Hosts Uplink going to the same Fabric Interconnect.
In the second option if the path involves going from Host-A up to FI-A, then northbound to the upstream switch, then back down eventually to FI-B and then Host-B. When this path is taken, if the two VMs can't reach each other then you have some problem with your upstream switches. If both VMs reside in the same subnet, it's a Layer2 problem. If they're in different subnets, then it's a Layer 2 or 3 problem somewhere north of UCS.
So knowing this - why did manual pinning on the N1K fix your problem? What pinning does is forces a VM to a particular uplink. What likely happened in your case is you pinned both VMs to Host Uplinks that both go to the same UCS Fabric Interconnect (avoiding having to be switched northbound). Your original problem still exists, so you're not clear out of the woods yet.
Ask yourself is - Why are just these two VMs affected. Are they possibly the only VMs using a particular VLAN or subnet?
An easy test to verify the pinning to to use the command below. "x" is the module # for the host the VMs are running on.
module vem x execute vemcmd show port-old
I explain the command further in another post here -> https://supportforums.cisco.com/message/3717261#3717261. In your case you'll be looking for the VM1 and VM2 LTL's and finding out which SubGroup ID they use, then which SG_ID belongs to whch VMNIC.
I bet your find the manual pinning "that works" takes the path from each host to the same FI. If this is the case, look northbound for your L2 problem.
thanks for your reply
VMs are in the same vlan and subnet
VMs on the same host can reach each over
On FI I created 4 pairs of vmnics , one pair for path A and B (HA purpose) and the reason is to use differents FI QOS policy.
The issue is the same for both options.
Vms are reachable from/to outside (like a remote PC)
2 pairs fo vmnics are working, 2 others not , but when I create new vmnic I have the same issue.
It's not link to a vlan because when I move a vlan from a failed Vmnic to a working vmnic , Vms can reach each over.
I can use (it's mandatory) the pinning id to chose the right vmnic and also the path (A or B)
I can check it by a "module vem x execute vemcmd show ports or pinning".
anyway when I apply the bad pinning id , vms are unreacable from anywhere.
I am working with cisco about this issue without having any solution for the moment.
We captured the data at the FI or Vms or N5k (wireshark)
Otherwise at one moment we forgot to unconfigure something and it worked .
When I let the vlan on 2 pairs fo vmnics (failed and good) and I configure a static pinning to the failed vmnic , it works but it's not a normal behavior and it seems that we don't use the path show by commands
In summary we are working like a private vlan without activation of this function.
So for the moment we are using a VDS for production VMs.
But I will read more carefully the link who sent.
One more question on mac pinning relying on cdp, is it a better solution ?
I have about half a dozen incidents of this happening... We have a couple hundred small 1000v deployments all configured the exact same way.
- VMs are on different servers
- VMs can ping gateway and out into the network
- VMs can't ping each other.
- VMs can ping each other when on the same server.
- everything looks good, the arp caches and FDB are all fine.
- The upstream switches have the correct configuration
I fix this by bouncing the veth ports or moving the VMs to the same server (then separating them).
I've always had to do the quick fix to get the systems back up, so never really had a chance to keep the error state for TAC to look at it. I'd love to find out why this is the case.
edit- Just wanted to add that when this occurs, there seems to be a power incident before hand... ie, site lost power or someone bounced my cluster... could it be some kind of race condition where the cluster comes up first and doesn't register something with the Virtual Center?
On your Qlogic are you doing NPAR? If so check and see if the MAC Learning feature is enabled. Also check out the latest drivers at vmware.