I wanted to see if anyone here had any recommendations for me in troubleshooting a problem that has been ongoing ever since we got our UCS equipment...
Initially it was noticed through solarwinds monitoring that many of our ESX hosts were showing receive discards. Usually in the neighborhood of 3k-5k per day. I have tried to diagnose this on my own and through several TAC cases over time but always failed to get anyone willing to help me out as I couldn't produce a packet capture of the traffic since it was usually very sporadic when the packets would drop throughout the day.
Lately I have had one of our ESX hosts start showing close to 1k packets discarded every hour which I thought would make it easier to capture what TAC needed to tell me why these packets are being dropped. I even isolated the issue of the excessive drops to one particular VM. So I have one B230 blade in a UCS chassis running ESX 5.0U1 with one single Window 2008 R2 VM with a VMXNet3 interface and on this host I am getting 1000+ packets per hour that are being discarded. As an FYI we are also running Nexus 1000V.
We went with all of this information to TAC and at first were told that Solarwinds was misreporting and there was no problem. After we pointed out that the same stats show within vCenter for the hosts they agreed to look a bit closer, and engaged VMWare.
VMWare after first reviewing the case referred us to this article: http://kb.vmware.com/kb/1010071
At first I was hopeful but I tried doubling both of the RX buffers listed in this article but it had no effect at all. We are not using Jumbo frames on this VM.
After VMWare saw that this did not fix the issue they have now requested a packet capture. Unfortunately, they want us to determine the other end of the conversation that is sending the dropped packets and get a capture there as well so they can see which packets are being dropped. We don't know what is transmitting these discards so we can't capture the other end.
I feel like there has to be a way of seeing what is dropping somewhere within UCS. The packets are entering the fabric interconnect fine and somehow being dropped by the Palo NIC so there has to be some way of capturing this no? There is absolutely no way that we are exceeding 10G of traffic to this one VM so it just seems bizarre that the packets would still be dropping. We are not exceeding limitations of CPU or memory for either the VM or the host. Also, to rule out issues specific to this particular host we migrated the VM to another host and see the excessive drops follow it. The other hosts still show 3-k-5k per day as before (which ideally I would also like to resolve some day) but the 25k number moves to whichever ESX host this particular VM resides on. The only thing this VM does is run an IIS server.
We are running UCS v 2.0(4a) but this behavior of discards has persisted through many different versions of UCS. VMWare version 5.0.0 8.21926. enic driver version 184.108.40.206
If anyone has any idea or thoughts on how to figure how what is being dropped I would love to hear. Thanks.
I'm going to assume your suspect VM is behind the 1000v?
Have you tried doing a packet capture with the vempkt utility? Does the 1000v show any dropped packets or are they reaching the VM and then being dropped by the VM's network driver?
I'm sure TAC would have taken a look at the Cisco VICs Rx/Tx drops, but likely the VMs traffic is going to pass fine through UCS, and are be dropped at the 1000v or VM driver level.
The suspect VM is behind the 1000v.
I have not tried doing a packet capture with the vempkt utility. TAC has asked me to use wireshark in the VM. which isn't helpful since the packets aren't reaching the VM and I don't know the other source of the conversations that are being dropped. Should I use vempkt on host to compare to wireshark in the VM and try to see if I can find any missing packets?
The receive discards are showing on the ESXi host's vmnic interface - not on the actual VM.
I am occasionally seeing some transmit drops from the VSM side and also some ar Tx drops when issuing the vemcmd show stats command on the line module. However, the amount I am seeing does not correlate with the vmnic discards showing on the ESXi host vmnic.
Which version of N1K are you running?
Also I need you to proivde a few outputs
From your N1K:
show run port-profile [uplink port profile name]
show int trunk
(You'll have to work out the vEth # of your ESX uplinks from this output)
What I want you to check is the VLANs configured on the vNICs of your ESX service profiles in UCS. Specifically the vNICs used as Uplinks for your VEM. We want to compare that with the "allowed VLAN" on your N1K uplink port profiles.
If there are any discrepancies (such as permitting some VLANS on the UCS vNIC, but not allowing some on the N1K uplink port proifle) this will count towards your Rx drops.
Version 4.2(1)SV1(5.2). I just went to check as I was writing this and see they released a new version last week. Maybe I'll try upgrading tonight.
But I have seen this behavior persist thorugh several versions as well.
I saw your earlier comment which is no longer showing about the vmnic discards eing expected behavior...this would help to explain a lot.
I have a couple of questions though if this is the case...
- Why is TAC so bewildered by this if it is expected behavior?
- Why would I see so many discards for the host that this one particular VM is on compared to all of the other hosts which have many more VMs? It seems like it is just broadcast traffic that would be discarded so shouldnt the value be the same for this as the others?
I just confirmed that behavior was changed a couple major version ago.. which is why I deleted it - assuming you were well beyond N1K version 1.3.
See my previous post for next actions to check.
When I compare the VLANs configured on the vNICS within the service profile to the N1K Uplink profiles they match up exactly.
There are four VLANs that show up on the Port channel uplink to the network on UCS that are not listed in the N1k port profiles:
- One of these is a VLAN used for a hearbeat between two UCS blades that are running Windows and SQL without ESX so this should not be required.
- Two of these are FCoE VLANs (one for each fabric side) that I was required to create during a UCS upgrade a while back. We are not using FCoE so I should think these also would not be required on the N1K uplinks.
- The last one is the default VLAN 1. We don't use VLAN 1 for anything internally other than management traffic that flows across it by default. Is VLAN 1 something I should have as part of my vmnic profiles anyway?
I tried moving the VM from the 1000V to the regular vSwitch today and am still seeing the discards at the same rate.
Not sure if this gives you any ideas?
Seems at least to exclude the 1000v as source of the problem.
You say you're noticing the receive discards on the vmnic of the ESX host, so what makes you think a VM is at fault?
Any chance you're using Jumbo Frames and/or Multicast?
We are seeing receive discards on several hosts. Most of these hosts show about 3-7k discards per 24 hours.
The one particular host that one specific VM resides on gets on average upwards of 25k discards per 24 hours. right now it is the only VM running on the host.
If I migrate the VM to another host, those discards "follow" the VM and start appearing in greater number on the new host it has migrated to.
Because of this, I felt it would be easiest to isolate the "worst case" for troubleshooting.
So right now we can even take 1000v out of it.
I have a single B230 blade with 20 cores, 128GB RAM, and two 10G Palo NICs running latest drivers with version of ESX 5.0U1.
On this blade I have a single VM running Windows Server 2008 R2. The VM has 8 vCPUs assigned, 16GB of RAM (within NUMA node) and one VMXNet3 Ethernet Adapter. It's primary application is IIS. It's CPU, Memory, and Network I/O graphs all appear to show values within the limitations of the hardware.
One single VM running on hardware that should be plenty capable, but for some reason ESX just randomly drops packets all day long. This just seems like a mystery we should be able to solve.
But I don't know where to go from here...
One more update....
I dont know if this is coincidental or not but moving the VM from N1k to the standard vSwitch appears to have reduced the discards, while still not eliminating them. If you look below I made the change at 2PM CDT yesterday. The overall amount has dropped substantially. But the number is still likely to reach close to 7k today which still doesnt make sense to me given this setup.
We made some progress on this case with TAC and VMWare yesterday.
After some digging on VMWare's behalf, they traced the issue back to a message the Palo adapter was returning to the host of rx_no_buffs that matched exactly the number of drops we are seeing.
The Cisco TAC engineer then delved into UCS and found a statistic matching this number on the adapter under rq_drops.
He had me go into the Adapter policy for my ESX hosts on UCS and increase the receive queue ring size from 512 to 1024.
I was hopeful this would resolve it but it has not yet. I am waiting to hear back from the TAC engineer but just wanted to see if anyone had any insight into the rq_drops statistics on Palo adapters and if there are any other suggestions.
I figured this out with TAC's help finally.
So after changing the network adapter policy I asked if a reboot was required and informed no at first. Upon further research the TAC engineer decided that it was a requirement for the change to take effect - even though UCS manager doesnt indicate anything about it. This time we went back in and changed the ring size to the maximum value of 4096. We also enabled RSS.
After rebooting our blades and utilizing the new adapter policy all of the rq_drops went away.
After some time of running with this configuration I started to see Tx drops on the VSM ports and Rx drops on the VM adapter interfaces. However, these numbers were drastically less than I was seeing at the host level.
For the VMs I was seeing this on I logged onto the OS and went to the advanced configuration of the network adapter properties.
This time I changed the following settings:
RSS - Enabled
RX Ring #1 Size - 4096
Small Rx Buffers - 8192
Following this change our entire environment has started running EXTREMELY clean. I see less than 50 drops per day total across all of our VMs now when I was seeing tens of thousands before changing these settings.
It seems like the default settings on the Palo adapter as provided by the VMWare adapter policy are not optimal for certain environments. In our environment the buffers were too small and were simply dropping packets. I am not sure why this would be the default since the hardware is capable of handling more.
In any case, I hope this helps someone down the line.
Were you actually seeing errors AND discards in Solarwinds? We have a similar incident going on right now with a new UCS deployment but we are seeing receive discards and CRC errors on the 10GE Catalyst Core side, both on two seperate ports and the port-channel encompassing both of those ports.
Just trying to see if we are in the same boat as you were. Did you ever get the TAC reasoning on why you needed to do this? Was this in any of their best practices documentation.