ESX vmnic Receive Discards on UCS - Page 2

timsilverline · ‎10-29-2012

I wanted to see if anyone here had any recommendations for me in troubleshooting a problem that has been ongoing ever since we got our UCS equipment...

Initially it was noticed through solarwinds monitoring that many of our ESX hosts were showing receive discards. Usually in the neighborhood of 3k-5k per day. I have tried to diagnose this on my own and through several TAC cases over time but always failed to get anyone willing to help me out as I couldn't produce a packet capture of the traffic since it was usually very sporadic when the packets would drop throughout the day.

Lately I have had one of our ESX hosts start showing close to 1k packets discarded every hour which I thought would make it easier to capture what TAC needed to tell me why these packets are being dropped. I even isolated the issue of the excessive drops to one particular VM. So I have one B230 blade in a UCS chassis running ESX 5.0U1 with one single Window 2008 R2 VM with a VMXNet3 interface and on this host I am getting 1000+ packets per hour that are being discarded. As an FYI we are also running Nexus 1000V.

We went with all of this information to TAC and at first were told that Solarwinds was misreporting and there was no problem. After we pointed out that the same stats show within vCenter for the hosts they agreed to look a bit closer, and engaged VMWare.

VMWare after first reviewing the case referred us to this article: http://kb.vmware.com/kb/1010071

At first I was hopeful but I tried doubling both of the RX buffers listed in this article but it had no effect at all. We are not using Jumbo frames on this VM.

After VMWare saw that this did not fix the issue they have now requested a packet capture. Unfortunately, they want us to determine the other end of the conversation that is sending the dropped packets and get a capture there as well so they can see which packets are being dropped. We don't know what is transmitting these discards so we can't capture the other end.

I feel like there has to be a way of seeing what is dropping somewhere within UCS. The packets are entering the fabric interconnect fine and somehow being dropped by the Palo NIC so there has to be some way of capturing this no? There is absolutely no way that we are exceeding 10G of traffic to this one VM so it just seems bizarre that the packets would still be dropping. We are not exceeding limitations of CPU or memory for either the VM or the host. Also, to rule out issues specific to this particular host we migrated the VM to another host and see the excessive drops follow it. The other hosts still show 3-k-5k per day as before (which ideally I would also like to resolve some day) but the 25k number moves to whichever ESX host this particular VM resides on. The only thing this VM does is run an IIS server.

We are running UCS v 2.0(4a) but this behavior of discards has persisted through many different versions of UCS. VMWare version 5.0.0 8.21926. enic driver version 2.1.2.22

If anyone has any idea or thoughts on how to figure how what is being dropped I would love to hear. Thanks.

timsilverline · ‎01-08-2013

Hi Matt-

No I was NOT seeing any errors on the interfaces, it was just discards.

The reasoning I explained above - was just that the buffers were filling up more quickly than they could dequeue the packets. Making the buffers larger allowed it to better handle bursty traffic without overflowing the queue. There are no best practice guides around this at all or anything the TAC engineer could provide explaining.

CRC errors usually indicate a L1/L2 problem. Are you seeing it on all of the interfaces or just one?

If it is only happening on certain interfaces, I would recommend swapping GBICS, Fiber, and triple checking any speed/duplex settings.

If it is all over that is pretty odd. Have you opened a TAC case?

mlinsemier · ‎01-08-2013

Thanks for the quick response on this.

I too think it may be a L1/L2 problem so we are starting there. There are currently 4x 10GE ports, in two port cahnnels, from a VSS 6509 to the new UCS-A and UCS-B FI. Right now all of the data is running over the FI to the UCS-B chassis and there are errors on both the port-channel and individual ports:

Switch#sh int Te1/4/3

TenGigabitEthernet1/4/3 is up, line protocol is up (connected)

Hardware is C6k 10000Mb 802.3, address is c464.1304.91e2 (bia c464.1304.91e2)

Description:

MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

reliability 253/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full-duplex, 10Gb/s

input flow-control is off, output flow-control is off

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:28, output 00:00:38, output hang never

Last clearing of "show interface" counters never

Input queue: 0/2000/539564/0 (size/max/drops/flushes); Total output drops: 0

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 942000 bits/sec, 594 packets/sec

5 minute output rate 3255000 bits/sec, 647 packets/sec

434969045 packets input, 220506190906 bytes, 0 no buffer

Received 838344 broadcasts (439593 multicasts)

0 runts, 0 giants, 0 throttles

539564 input errors, 313755 CRC, 0 frame, 0 overrun, 0 ignored

0 watchdog, 0 multicast, 0 pause input

0 input packets with dribble condition detected

1540935135 packets output, 1141605974908 bytes, 0 underruns

0 output errors, 0 collisions, 2 interface resets

0 babbles, 0 late collision, 0 deferred

0 lost carrier, 0 no carrier, 0 PAUSE output

0 output buffer failures, 0 output buffers swapped out

The other ports in the non primary FI are showing clean. The plan right now is to fail everything over to the other UCS and force it to use the other FI and see if the problem follows. If it does it seems like a TAC case is involved. If it runs clean, it seems that we may have some bad fiber or tranceivers on one end or the other

SVO RMAOne · ‎12-12-2012

Test

Sent from Cisco Technical Support iPad App

christoph.hochsticher · ‎10-28-2015

Strange. Same problem here. Newer firmware and drivers, but same problem.

All my VMs living on UCS Blades have this problem. Do i migrate a VM to Dell Rackservers, the dropps on the guests are away.