02-20-2014 08:11 AM - edited 03-01-2019 11:32 AM
We are having a problem with what appears to be NIC hardware buffers being over run.
We run B200 M3 servers with1 x E5-2670 v2 10C CPU+32GB RAM and the VIC1240 NIC Card on RHEL 6.4.
These servers run on both 6120 and 6248 FI's running firmware 2.1(3a).
These servers are running applications which receive UDP traffic at high packet rates when subscribing to approx 16 multicast groups.
During high load events ( we believe these to be microburst of traffic ) we see the processes which are running on these servers error with missing message sequences and the process goes to re-sync the missing data or kicks itself out of the cluster.
Both outcomes are not acceptable.
These errors co-incide with the rx_no_bufs value increasing.
NIC statistics:
tx_frames_ok: 661216
tx_unicast_frames_ok: 547203
tx_multicast_frames_ok: 113970
tx_broadcast_frames_ok: 43
tx_bytes_ok: 60485096
tx_unicast_bytes_ok: 51535636
tx_multicast_bytes_ok: 8946708
tx_broadcast_bytes_ok: 2752
tx_drops: 0
tx_errors: 0
tx_tso: 0
rx_frames_ok: 1576022571
rx_frames_total: 1576031052
rx_unicast_frames_ok: 546721
rx_multicast_frames_ok: 1575481611
rx_broadcast_frames_ok: 2720
rx_bytes_ok: 656139514804
rx_unicast_bytes_ok: 356487232
rx_multicast_bytes_ok: 655789236895
rx_broadcast_bytes_ok: 185672
rx_drop: 0
rx_no_bufs: 8481
rx_errors: 0
rx_rss: 0
rx_crc_errors: 0
rx_frames_64: 5856
rx_frames_127: 12490212
rx_frames_255: 693609513
rx_frames_511: 456069311
rx_frames_1023: 242721785
rx_frames_1518: 171134375
rx_frames_to_max: 0
We have tried up'ing the recieve queues to 8 and max buffers of 4096 with interupt timers of 10us, however still get drops.
The defaults of 1 queue and ether 512 or 1024 buffers( int timer 10us ) did not cut it and we had lots of errors.
eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 10000
link/ether 00:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
2901579300 1590139645 0 8481 0 1589598683
TX: bytes packets errors dropped carrier collsns
60491368 661314 0 0 0 0
The packet drops seem to be related how many multicast groups we subscribe as running perfomance tests across 1 multicast group performs fine with per second message counts far exceeding the amounts coming down the line when we subscribed to all 16 multicast groups.
It looks like the performance of the VIC 1240 is not up to it.
Would a different network card such as the VIC 1280 help .... the way it looks is that the ASIC is the same it just has more channels to enable 80GB/sec ?
We are having QoS issues in parallel with this issue at the moment and we are looking at implementing a no-drop QoS policy on the VLAN on which the server operates.
Is it logical to think that this is independent of the drops on the server NIC card as these are a plain case of the card not keeping up ?
Any input would be much appreciated.
Rob.