We are having a problem with what appears to be NIC hardware buffers being over run.
We run B200 M3 servers with1 x E5-2670 v2 10C CPU+32GB RAM and the VIC1240 NIC Card on RHEL 6.4.
These servers run on both 6120 and 6248 FI's running firmware 2.1(3a).
These servers are running applications which receive UDP traffic at high packet rates when subscribing to approx 16 multicast groups.
During high load events ( we believe these to be microburst of traffic ) we see the processes which are running on these servers error with missing message sequences and the process goes to re-sync the missing data or kicks itself out of the cluster.
Both outcomes are not acceptable.
These errors co-incide with the rx_no_bufs value increasing.
We have tried up'ing the recieve queues to 8 and max buffers of 4096 with interupt timers of 10us, however still get drops.
The defaults of 1 queue and ether 512 or 1024 buffers( int timer 10us ) did not cut it and we had lots of errors.
eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 10000
link/ether 00:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
2901579300 1590139645 0 8481 0 1589598683
TX: bytes packets errors dropped carrier collsns
60491368 661314 0 0 0 0
The packet drops seem to be related how many multicast groups we subscribe as running perfomance tests across 1 multicast group performs fine with per second message counts far exceeding the amounts coming down the line when we subscribed to all 16 multicast groups.
It looks like the performance of the VIC 1240 is not up to it.
Would a different network card such as the VIC 1280 help .... the way it looks is that the ASIC is the same it just has more channels to enable 80GB/sec ?
We are having QoS issues in parallel with this issue at the moment and we are looking at implementing a no-drop QoS policy on the VLAN on which the server operates.
Is it logical to think that this is independent of the drops on the server NIC card as these are a plain case of the card not keeping up ?
Any input would be much appreciated.
Driver output is :
There are 2 vNIC's. 1 NIC for multicast backbone network, 1 for nomal TCP traffic.
The following Adapter Profile NIC settings were applied via UCS manager :
Transmit Queues = 1
Ring Size = 1024
Receive Queues = 8
Ring Size = 1024 - 4096 ( same behaviour with either setting )
Completion Queues = 9
Interurrpts = 11
Transmit Checksum Offload = Enabled
Receieve Checksun Offload = Enabled
TCP Segmentation Offload = Enabled
TCP Large Receive Offoad = Enabled
Receive Side Scaling = Enabled
Failback Timeout = 5
Interrupt mode = MSI X
Int Coalescing = MIN
Interrupt Timer = 10us
I have attached the interrupt output from the kernel in the original posting.
I faced with the same issue on RedHat 6.5 2.6.32-431.23.3.el6.x86_64 kernel.
proprietary driver 188.8.131.52 (from cisco.com) is installed according reccomendation from http://www.cisco.com/web/techdoc/ucs/interoperability/matrix/matrix.html for our UCS sw version 5.2(3)N2(2.23d)UCS-IOM-2204XP UCSN-MLOM-40G-01 VIC1240
On OS side i neither see ring (buffer) value nor can change it
ethtool --show-ring eth2
Ring parameters for eth2: Cannot get device ring settings: Operation not supported
ethtool --set-ring eth2 rx 1024 tx 1024 Cannot get device ring settings: Operation not supported
Th same command work well for the same RedHat version under vmware(esxi) what make me think that i faced with cisco proprietary driver problem.
As result we have tcp packet drops near on 1Gbps level
Do anybody find the problem solution?
what could be checked
1) install actual drivers for NIC according UCS SW/HW version (http://www.cisco.com/web/techdoc/ucs/interoperability/matrix/matrix.html ). Increase amount of queues on UCS side from default to recommended and buffer size. (http://toreanderson.github.io/2015/10/08/cisco-ucs-multi-queue-nics-and-rss.html )
2) Check current ring value on OS side by ethtool -G or if it's not supported (CSCuy51507) then check it in dmesg log
enic 0000:06:00.0: vNIC MAC addr 00:25:b5:00:00:0f wq/rq 4096/4096 mtu 1500
enic 0000:06:00.0: vNIC csum tx/rx yes/yes tso/lro yes/yes rss yes intr mode any type min timer 125 usec loopback tag 0x0000
enic 0000:06:00.0: vNIC resources avail: wq 8 rq 8 cq 16 intr 18
3) irqbalance should be enabled or in case kernel was updated then irqbalance could be disabled but Interrupts/queues balancing should be added manually (set_affinity script) at the same time.
4) probably kernel is not getting enough time to clear the ring size buffer
It could be fixed by #sysctl -w net.core.netdev_budget=600 (to make it permanent add net.core.netdev_budget=600 to /etc/sysctl.conf )
In my case this step help to solve ring buffer drops problem.
5) check "How can I tune the TCP Socket Buffers" article
# netstat -sn | egrep "prune|collap"; sleep 30; netstat -sn | egrep "prune|collap"
if necessary tune cat /proc/sys/net/ipv4/tcp_rmem (require application restart) or application settings.
Excellent article with hints that saved us from months of troubleshooting !
We have issues with NFS storage networking for over 8 months on two Citrix XenServer clusters ( a RHEL distribution derivative) based on UCS Blade hardware. We have escalated the issue with CITRIX and concluded that there are packets lost outside XenServer in vNIC level.
By increasing tx/rx queues (it was 1/1 the default) on Ethernet Adapter Policies according to this article it seems that the issue resolved.
A million thanks to the community !
I am facing the same issue of Rx drop on my UCS physical NIC connected to Fabric Internconnect.
I also see that the RX drops are almost on devices which are connected to Fabric interconnect and bad thing is that almost everything is connected to FI.
Any step by step solution we can apply to fix the buffer issue or RX drop. We have already updated drivers, increased queue to 8 and increase the buffer size to 4096. That reduced the drop counts but did not fix the problem completely.
We still around 50000 drops in last 12 hours...