During some network connectivity troubleshooting I discovered that certain Veth interfaces on my VSM are showing regularly occuring OutDiscards.
I was able to diagnose the larger network issues as being related to something else, but now I have returned my attention to trying to determine why certain interfaces are still showing these discards.
It is only happening on select Veth interfaces, and it is very sporadic. One interface gets about 1000 of these per day but they come in bunches of about 100 at a time - so theres roughly 5-10 events happening per day that are causing many drops at once. Other interfaces exhibiting the problem will show anywhere from 10-500 discards per day. About 10 interfaces in total are showing this behavior. The VMs themselves are mostly Windows 2008 R2 but a couple linux virtual appliances are showing small numbers of drops as well.
Reading online I found people give indications that they had these problems with the e1000 type interfaces in VMWare or had driver issues.
All of the most problematic interfaces are in fact VMXNet3 interfaces and I have tried completely uninstalling/reinstalling VMWare tools, deleting and reading the NIC within vCenter, and completely deleting and reinstalling the drivers in Windows without any success at improving the problem.
I did contact TAC at one point but their only suggestion was to get a capture of the traffic as its happening for further analysis. Unfortunately, the 15-30 second window suggested to grab this traffic is just unrealistic. I am not able to sit all day long and run capture after capture on the VEM and hope that I get lucky. Even if I tried I doubt I would be so lucky.
So I was hoping someone on here might have a better suggestion for me of what I could try to alleviate these drops or at least determine what is causing them.
The hardware setup is UCS B230 blades with Palo adapters, Everything on the network is Cisco. Connects up to a 6120 fabric interconnect pair which are vPC'd into our core Nexus 7010 switches. Any other details required to provide assistance please let me know and I'd be happy to provide them.
Sounds like you're done your due diligence. Output veth discards come from the 1000v towards your VM. As you've found the best way to capture this would be to do a vempkt capture on the 1000v filtering on drops - then we can see "what" exactly is being dropped. As you mention it's at random times of the day, capturing this would be difficult.
I would start logging all the occurances as best you can tracking placements of the VMs, times of day etc. By trending you might find a clue. I would be looking for any similarities in:
- Hosts running the VM
- VM network properties (VLAN, subnet etc)
- Applications running on the affected VMs (backup agents, services etc)
- Time of day this occurs (does it coincide with any tasks or jobs)
**Can you clear all interface counters, then the next day get the output of:
show int | egrep -i "ethernet|drop"
(kudos to Mike P)
Also provide the following:
- Previous TAC SR #
- Version of ESX and 1000v
Thank you for your response.
In the past 24 hours, oddly, only the most problematic interface has been showing these drops. All of the other ones are clean. The one interface has accumulated 1671 drops during this time.
Snip of the command you requested for the interface:
Vethernet57 is up
0 Input Packet Drops 1671 Output Packet Drops
I have seen drops on all VLANs, multiple different hosts, and both of our UCS chassis when I am seeing the drops on other interfaces. 24 hours is a long time for it to only happen on this one VM based on the past week's history. It would be great if the problem were at least limited to only one VM now but since I haven't changed anything to impact this I am guessing it will start happening again soon on the others. This is a good illustration of why it has been so hard for me to pinpoint though - it just seems completely random. The TAC engineer I spoke with seemed to think it was okay if we were seeing some drops but from my past experience regularly occuring packet loss is never good. Are regular OutDiscards really expected and seen by others using the N1kV?
From an application perspective, I think some sort of HTTP service is actually running on all of the hosts that have been showing the drops. I will continue to keep an eye out for this and try to validate this theory but not sure how this helps?
Even from a time of day perspective it is hard to determine. I cleared counters around this time yesterday, then sometime at night I saw the number go up to 220 or so. This morning it was about the same. Sometime between then and now it has experienced these 1400 additional drops. Solarwinds Orion is actually tracking this for us and you can see here how the numbers just spike in bunches. Below is the times of the additional 1400 drops. Looks like 4 specific instances in times this happened.
The previous TAC SR# is 620218877
We are running the latest versions of everything:
ESXi 5.0.0 Build 515841 and Nexus 1000V 4.2(1)SV1(4a)
Let me know if that gives you any ideas?
As an update to yesterday's post....
I am now showing three Veth interfaces with drops in the previous 48 hours.
The one that had 1671 yesterday is now up to 2018. Here is the trending chart....
The other two are at 167 and 55 drops. Not sure the command output you asked fo ris really neeed since it just reiterates these values?
The other two hosts are both windows. However, one is a VDI desktop running Windows 7 (no web services) and the other is Windows 2008 R2 running IIS. I think the Windows 7 machine rules out any commonality in OS or configuration settings.
Really not sure what else to look for at this point. I am glad it seems to be happening on less machines but it still doent seem right that VMs are dropping so many packets regularly.
Since this issue is intermittent, the only other thing I might suggest is to look at the times the clusters of drops occur, and correlate this with the VM system logs (Windows App & System logs) and see if there are any services or tasks running during this time.
I'll continue to dig on my end for what else we can do to help track this down. Keep looking for patterns in the meantime.
What is the uplink method you're using? (Mac pinning, Mode Active/On).
Also are you using any multicast applications within your network or within the VLANs of the affected VMs? If you're not 100% sure you can simply do a packet capture on the VLAN (from anywhere in your network) of one of the affected VMs and see if there's any traffic with a muticast source address (224.x.x.x or 239.x.x.x)
Uplink mode command is "channel-group auto mode on mac-pinning" per the best practice documentation I read.
There is no multicast on our network anywhere.
Little drops on the most impacted VM last two days. Only up to 2248 total discards and nothing in past 24 hours. Nothing has changed but this changing behavior is what I have been seeing the whole time.
Seeing drops on a total of five interfaces at this time since the initial clear 4 days ago.
One of them is a VMWare View Linked-Clone golden image that is usually off but I turned off for a brief period last night to apply windows updates before a recompose. Late at night with very little traffic on our network at the time. And it shouldnt have been doing anything else network wise other than checking and applying windows updates.