VIC1240 running out of buffers

igucs_support · ‎02-20-2014

We are having a problem with what appears to be NIC hardware buffers being over run.

We run B200 M3 servers with1 x E5-2670 v2 10C CPU+32GB RAM and the VIC1240 NIC Card on RHEL 6.4.

These servers run on both 6120 and 6248 FI's running firmware 2.1(3a).

These servers are running applications which receive UDP traffic at high packet rates when subscribing to approx 16 multicast groups.

During high load events ( we believe these to be microburst of traffic ) we see the processes which are running on these servers error with missing message sequences and the process goes to re-sync the missing data or kicks itself out of the cluster.

Both outcomes are not acceptable.

These errors co-incide with the rx_no_bufs value increasing.

NIC statistics:

tx_frames_ok: 661216

tx_unicast_frames_ok: 547203

tx_multicast_frames_ok: 113970

tx_broadcast_frames_ok: 43

tx_bytes_ok: 60485096

tx_unicast_bytes_ok: 51535636

tx_multicast_bytes_ok: 8946708

tx_broadcast_bytes_ok: 2752

tx_drops: 0

tx_errors: 0

tx_tso: 0

rx_frames_ok: 1576022571

rx_frames_total: 1576031052

rx_unicast_frames_ok: 546721

rx_multicast_frames_ok: 1575481611

rx_broadcast_frames_ok: 2720

rx_bytes_ok: 656139514804

rx_unicast_bytes_ok: 356487232

rx_multicast_bytes_ok: 655789236895

rx_broadcast_bytes_ok: 185672

rx_drop: 0

rx_no_bufs: 8481

rx_errors: 0

rx_rss: 0

rx_crc_errors: 0

rx_frames_64: 5856

rx_frames_127: 12490212

rx_frames_255: 693609513

rx_frames_511: 456069311

rx_frames_1023: 242721785

rx_frames_1518: 171134375

rx_frames_to_max: 0

We have tried up'ing the recieve queues to 8 and max buffers of 4096 with interupt timers of 10us, however still get drops.

The defaults of 1 queue and ether 512 or 1024 buffers( int timer 10us ) did not cut it and we had lots of errors.

eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 10000

link/ether 00:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff

RX: bytes packets errors dropped overrun mcast

2901579300 1590139645 0 8481 0 1589598683

TX: bytes packets errors dropped carrier collsns

60491368 661314 0 0 0 0

The packet drops seem to be related how many multicast groups we subscribe as running perfomance tests across 1 multicast group performs fine with per second message counts far exceeding the amounts coming down the line when we subscribed to all 16 multicast groups.

It looks like the performance of the VIC 1240 is not up to it.

Would a different network card such as the VIC 1280 help .... the way it looks is that the ASIC is the same it just has more channels to enable 80GB/sec ?

We are having QoS issues in parallel with this issue at the moment and we are looking at implementing a no-drop QoS policy on the VLAN on which the server operates.

Is it logical to think that this is independent of the drops on the server NIC card as these are a plain case of the card not keeping up ?

Any input would be much appreciated.

Rob.

padramas · ‎02-21-2014

Hello Rob,

Hope you are doing good.

ENIC driver version ?

How many vNICs are configured on it ?

We need more granular details ( HT, interrupts etc ).

Do you have TAC SR for this issue ? If not, I would suggest you to open one to further assist you on this issue.

Padma

igucs_support · ‎02-26-2014

Hi Padma,

Driver output is :

driver: enic

version: 2.1.1.39

firmware-version: 2.1(3a)

bus-info: 0000:06:00.0

supports-statistics: yes

supports-test: no

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: no

There are 2 vNIC's. 1 NIC for multicast backbone network, 1 for nomal TCP traffic.

The following Adapter Profile NIC settings were applied via UCS manager :

Transmit Queues = 1

Ring Size = 1024

Receive Queues = 8

Ring Size = 1024 - 4096 ( same behaviour with either setting )

Completion Queues = 9

Interurrpts = 11

Transmit Checksum Offload = Enabled

Receieve Checksun Offload = Enabled

TCP Segmentation Offload = Enabled

TCP Large Receive Offoad = Enabled

Receive Side Scaling = Enabled

Failback Timeout = 5

Interrupt mode = MSI X

Int Coalescing = MIN

Interrupt Timer = 10us

I have attached the interrupt output from the kernel in the original posting.

zverev.mv · ‎02-01-2016

I faced with the same issue on RedHat 6.5 2.6.32-431.23.3.el6.x86_64 kernel.

proprietary driver 2.1.1.75 (from cisco.com) is installed according reccomendation from http://www.cisco.com/web/techdoc/ucs/interoperability/matrix/matrix.html for our UCS sw version 5.2(3)N2(2.23d)UCS-IOM-2204XP UCSN-MLOM-40G-01 VIC1240

modinfo enic
filename: /lib/modules/2.6.32-431.23.3.el6.x86_64/weak-updates/enic/enic.ko
version: 2.1.1.75

On OS side i neither see ring (buffer) value nor can change it

ethtool --show-ring eth2

Ring parameters for eth2: Cannot get device ring settings: Operation not supported

ethtool --set-ring eth2 rx 1024 tx 1024 Cannot get device ring settings: Operation not supported

Th same command work well for the same RedHat version under vmware(esxi) what make me think that i faced with cisco proprietary driver problem.

As result we have tcp packet drops near on 1Gbps level

Do anybody find the problem solution?

zverev.mv · ‎09-29-2016

what could be checked

1) install actual drivers for NIC according UCS SW/HW version (http://www.cisco.com/web/techdoc/ucs/interoperability/matrix/matrix.html ). Increase amount of queues on UCS side from default to recommended and buffer size. (http://toreanderson.github.io/2015/10/08/cisco-ucs-multi-queue-nics-and-rss.html )

2) Check current ring value on OS side by ethtool -G or if it's not supported (CSCuy51507) then check it in dmesg log

enic 0000:06:00.0: vNIC MAC addr 00:25:b5:00:00:0f wq/rq 4096/4096 mtu 1500
enic 0000:06:00.0: vNIC csum tx/rx yes/yes tso/lro yes/yes rss yes intr mode any type min timer 125 usec loopback tag 0x0000
enic 0000:06:00.0: vNIC resources avail: wq 8 rq 8 cq 16 intr 18

3) irqbalance should be enabled or in case kernel was updated then irqbalance could be disabled but Interrupts/queues balancing should be added manually (set_affinity script) at the same time.

4) probably kernel is not getting enough time to clear the ring size buffer

It could be fixed by #sysctl -w net.core.netdev_budget=600 (to make it permanent add net.core.netdev_budget=600 to /etc/sysctl.conf )

In my case this step help to solve ring buffer drops problem.

5) check "How can I tune the TCP Socket Buffers" article

# netstat -sn | egrep "prune|collap"; sleep 30; netstat -sn | egrep "prune|collap"

if necessary tune cat /proc/sys/net/ipv4/tcp_rmem (require application restart) or application settings.

jrandles65 · ‎01-13-2015

was there a resoluton to this issue ? what fixed it ?

dwtcp · ‎01-19-2018

Excellent article with hints that saved us from months of troubleshooting !

We have issues with NFS storage networking for over 8 months on two Citrix XenServer clusters ( a RHEL distribution derivative) based on UCS Blade hardware. We have escalated the issue with CITRIX and concluded that there are packets lost outside XenServer in vNIC level.

By increasing tx/rx queues (it was 1/1 the default) on Ethernet Adapter Policies according to this article it seems that the issue resolved.

A million thanks to the community !

kkumara@hcl.com · ‎02-14-2018

Hi Team,

I am facing the same issue of Rx drop on my UCS physical NIC connected to Fabric Internconnect.

I also see that the RX drops are almost on devices which are connected to Fabric interconnect and bad thing is that almost everything is connected to FI.

Any step by step solution we can apply to fix the buffer issue or RX drop. We have already updated drivers, increased queue to 8 and increase the buffer size to 4096. That reduced the drop counts but did not fix the problem completely.

We still around 50000 drops in last 12 hours...

Thanks

Arun Kumar