We are having a problem with what appears to be NIC hardware buffers being over run.
We run B200 M3 servers with1 x E5-2670 v2 10C CPU+32GB RAM and the VIC1240 NIC Card on RHEL 6.4.
These servers run on both 6120 and 6248 FI's running firmware 2.1(3a).
These servers are running applications which receive UDP traffic at high packet rates when subscribing to approx 16 multicast groups.
During high load events ( we believe these to be microburst of traffic ) we see the processes which are running on these servers error with missing message sequences and the process goes to re-sync the missing data or kicks itself out of the cluster.
Both outcomes are not acceptable.
These errors co-incide with the rx_no_bufs value increasing.
We have tried up'ing the recieve queues to 8 and max buffers of 4096 with interupt timers of 10us, however still get drops.
The defaults of 1 queue and ether 512 or 1024 buffers( int timer 10us ) did not cut it and we had lots of errors.
eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 10000
link/ether 00:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
2901579300 1590139645 0 8481 0 1589598683
TX: bytes packets errors dropped carrier collsns
60491368 661314 0 0 0 0
The packet drops seem to be related how many multicast groups we subscribe as running perfomance tests across 1 multicast group performs fine with per second message counts far exceeding the amounts coming down the line when we subscribed to all 16 multicast groups.
It looks like the performance of the VIC 1240 is not up to it.
Would a different network card such as the VIC 1280 help .... the way it looks is that the ASIC is the same it just has more channels to enable 80GB/sec ?
We are having QoS issues in parallel with this issue at the moment and we are looking at implementing a no-drop QoS policy on the VLAN on which the server operates.
Is it logical to think that this is independent of the drops on the server NIC card as these are a plain case of the card not keeping up ?
Any input would be much appreciated.
Hope you are doing good.
ENIC driver version ?
How many vNICs are configured on it ?
We need more granular details ( HT, interrupts etc ).
Do you have TAC SR for this issue ? If not, I would suggest you to open one to further assist you on this issue.