Dropped packets on ESXi management network with UCS mini and Nexus switches

eamehostedservices · ‎12-12-2018

Hi All,

We are encountering a very strange and frustrating issue at the moment with our ESXi 6.0 hosts that are hosted on B200M4 blades on UCS Mini environments. I am hoping someone has seen this issue before or can help.

Daily, with what seems like no pattern, the management IP of an ESXi host (happens on multiple hosts in different units/locations), will stop responding to pings for anything from 30 seconds to 5 minutes. It will come up again by itself.

Having investigated this issue for weeks we have been able to see the following:

- The issue is occurring across multiple UCS firmware versions 3.2(2c) and 3.3(2f)

- Only the ESXi management network is affected. Other networks on the same VIC card (e.g. NFS, vMotion and VM Network traffic are fine).

- logging onto a VMWare console session via KVM whilst the issue is occuring and running a command to show the stats for the management vnic shows 0 packets dropped: esxcli network nic stats get -n vmnic0

- esxcli network nic list shows all links as up

- pinging the management IP address of the second ESXi host in the chassis works as normal.

All affected chassis are linked to Nexus switches via 10GB Twinax or fibre cable.

From the tests we have done so far I guess we can deduce that the issue is beyond the VIC as links stay up and we are able to ping other hosts in the same chassis

The issue also seems to be beyond the FI as again, you can ping hosts in the same chassis but not outside it during these outage periods.

Has anyone come across anything similar to this before on their UCS mini environments? We have confirmed with VMWare/Cisco that the ESXi drivers are correct (using Cisco build of ESXi 6.0). MTU (9000) is also verified as being correct all the way through from UCS environment to VMWare. Despite this, we have not been able to get any further with establishing a root cause. Does anyone have any ideas on anything else to check?

Any advice appreciated.

Thanks

Kirk J · ‎12-12-2018

Greetings.

I would setup a span session on the upstream nexus devices for the ports connected to the FIs.

This should fairly quickly eliminate the FIs from the problem/root cause if you seen the icmp requests coming up into the switches, but not delivered to the ports going to the destination FI.

You might also want to disable ports on you upstream switches to force all UCSM traffic in/out only one upstream switch (I'm assuming a VPC setup) to help isolate if one switch vs the other is cause the drops.

Kirk...

Wes Austin · ‎12-12-2018

Sounds like it may be related to this if its only vmknic0. Try the workaround and see if you have any success.

https://kb.vmware.com/s/article/1031111

To delete a vmknic from a port group, use this command:

# esxcfg-vmknic -d -p pgName

or

# esxcfg-vmknic -d pgName

To add a vmknic to a port group, run the command:

# esxcfg-vmknic -a -i DHCP -p pgName

or

# esxcfg-vmknic -a -i x.x.x.x -n 255.255.255.0 pgName

eamehostedservices · ‎12-13-2018

Hi Wes,

I checked this out and unfortunately, there are no duplicated MACs other than the one for vmnic0 and vmk0 which is normal across every ESXi host. Thanks for your suggestion though.

Wes Austin · ‎12-13-2018

I have still seen this exact same behavior you mention and that KB fixes it. I would just delete and re-create a vmknic on a host that has had the issue consistently and see if it has any impact on the behavior.

eamehostedservices · ‎12-13-2018

Hi Wes,

At this point, I'll try anything! I'll give it a go and come back.

Thanks

Dilpesh

eamehostedservices · ‎01-07-2019

Wes, you seem to have succeeded where every other member of TAC failed! We had multiple sites that had this issue. After recreating vmk0 on the sites that were affected, the problem seems to have gone away! It would be nice to know what is causing this issue in the first place but its great that we now have the fix! Thank you so much for your advice. We've been trying to resolve this for months with various calls to VMWare, TAC on the UCS and network sides with nobody from these areas even suggesting this.THANK YOU!

mojafri · ‎12-13-2018

Hi @eamehostedservices,

Could you please share the topology?

1. Do you have separate link towards northbound carrying mgmt vlan?

2. vmk is part of DVS or vswitch? What is the load-balancing method?

3. From the below statement, were MAC for both vmk's learned on same FI?

- pinging the management IP address of the second ESXi host in the chassis works as normal.

4. Uplink switch model/Version?

5. Do you have any other UCS domain connected to same uplink switch?

6. From the below statement, were you are sourcing ping from? Esxi within UCS/somewhere from northbound?

the management IP of an ESXi host (happens on multiple hosts in different units/locations), will stop responding to pings for anything from 30 seconds to 5 minutes.

7. What about MAC learning on uplink switch during the time of outage?

Regards,

MJ

eamehostedservices · ‎03-19-2019

Hi Everyone,

It looks as if the celebrations a few months back were premature. The issue, seems to be back on some of our environments. Last year we did as Wes said and deleted our vmk0 management network connection on all affected hosts. This seemed to resolve the issue. However, on some hosts, the issue seems to have returned after the host has been rebooted. The previous fix unfortunately does not work when tried again.

Newer hosts that we have built fresh on ESXi 6.7 on B200M5 hardware are also having the same problem. Deleting and recreating the vmk0 interface does not help on these either.

Most of the environments we recreated the interfaces on last year are still ok so it seems we are not quite back to square one but it would still be really good if someone could let us know if they have seen this behaviour before.

FYI, this behaviour persists across multiple hosts, different hardware versions, different UCS FW versions and locations along with different versions of VMWare (6.0 and 6.7). I am hoping that the fact that we were able to solve the issue for a little while by deleting vmk0 and recreating it might give someone an insight as to where we may look next.

Any advice appreciated

Thanks

eamehostedservices · ‎03-19-2019

Answers below:

Could you please share the topology?

All affected sites are UCS mini chassis. The blades are either B200M4 or M5 running VMWare ESXi 6.0 or 6.7 via SD cards. Each FI has 2x 10GB connections to 2 separate Cisco Nexus 3524 switches e.g. FI-A has 1x connection to Nexus A and B and the same for FI-B.

Each 10GB pair on an FI is connected together via a LACP port channel. All of the VLANs required for ESXi e.g. management traffic, VM traffic are all trunked down these port channels. All configuration is managed via UCS central.

On the UCS side, there are 8 vnics defined for ESXi.

0 (FIA) - management network - connected to a routable management network used for ESXi management and UCS CIMC traffic. In VMWare, this is an active active configuration and is defined on a standard virtual switch.

1 (FIB) - management network - connected to a routable management network used for ESXi management and UCS CIMC traffic.active active configuration and is defined on a standard virtual switch.

As the management network is the only one we are having an issue with, I'll leave the explanation there.

1. Do you have separate link towards northbound carrying mgmt vlan?

Depends what you mean, all traffic is physically down the same pair of connections for each FI but is logically separated using VLANs and a vnic pair for each traffic type. The management traffic that we are having the issue with is on its own VLAN and trunked on the Nexus switches down to the FI. Northbound, the Nexus switches are usually connected to CDS core switches but this varies depending on the site.

2. vmk is part of DVS or vswitch? What is the load-balancing method?

The vmk is part of a standard vSwitch. Load balancing is based on originating virtual port ID (Default)

3. From the below statement, were MAC for both vmk's learned on same FI?

- pinging the management IP address of the second ESXi host in the chassis works as normal.

I'm not quite sure what you mean here but when ESXi is installed, it takes the MAC from the first vnic it sees (vnic0) as the MAC of the vmk for the management network. As vnic0 is always on FI-A, I guess the MAC would be learned from A.

4. Uplink switch model/Version?

Cisco Nexus 3524

5. Do you have any other UCS domain connected to same uplink switch?

No

6. From the below statement, were you are sourcing ping from? Esxi within UCS/somewhere from northbound?

The management IP of an ESXi host (happens on multiple hosts in different units/locations), will stop responding to pings for anything from 30 seconds to 5 minutes.

I have tried multiple ways, from a client PC on our network I get the above behaviour. If running a continuous vmkping from one ESXi host to another, the ping doesnt seem to drop, even if it has dropped to my client.

7. What about MAC learning on uplink switch during the time of outage?

I'm not a network person so I'm not sure what the answer to this question is. Please clarify and I will talk to our network guys to find out.

Thanks for any advice given.