Slow data access with Vmware / Nexus 5548UP / Netapp

jean-philippe.guilloux · ‎04-16-2013

Hello,

We've been experiencing a lot of performance issues last months and I'm still trying to figure out where the problem is.

We have a IBM NSeries 6040 (= Netapp) storage controler (2 controlers), connected to two Nexus 5548 UP with FibreChannel (2 fibres / controler, one on each Nexus 5548 => 4 Fibres).

Our ESXi 5.1 servers are connected with FCOE to the two Nexus 5548UP, by Qlogic 8152 cards. Paths are managed with ALUA (Netapp provides ALUA support), so it's round robin on the right paths.

I updated all the firmwares one the hosts (Novascale R460, DELL R710), Qlogic 8152.

The Netapp (IBM) support says "it's not the storage" ... so I'm investigating about the Nexus config.

I still have huge latencies , given by esxtop and alerts one the hosts (latency increased to ...) .. this latency sometime reaches 300 ms !!!! which, by vmware is unacceptable.

I'm I missing something in the Nexus config ? I never had any course about it ...

Here is the config, which is the same on both nexus, except for the VSAN vlan id.

interface Ethernet1/7

description ESX7

switchport mode trunk

switchport trunk allowed vlan xx,xxx,xxx,xxxx,xxxxxx,

channel-group 7

interface port-channel7

description ESX7

switchport mode trunk

switchport trunk allowed vlan xx,xxx,xxx,xxxx,xxxxxx,

speed 10000

vpc 7

interface vfc7

bind interface port-channel7

switchport description VFC ESX7

no shutdown

SAN Connection:

interface fc1/32

switchport trunk allowed vsan Y

switchport description N6040-CTRLA

no shutdown

interface fc2/16

switchport trunk allowed vsan Y

switchport description N6040-CTRLB

no shutdown

sho inter eth1/7 counters errors

--------------------------------------------------------------------------------

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards

--------------------------------------------------------------------------------

Eth1/7 0 0 0 0 0 0

--------------------------------------------------------------------------------

Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts

--------------------------------------------------------------------------------

Eth1/7 0 0 0 0 0 0

--------------------------------------------------------------------------------

Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err

--------------------------------------------------------------------------------

Eth1/7 0 -- 0 0 0 0

sho inter po7 counters errors

--------------------------------------------------------------------------------

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards

--------------------------------------------------------------------------------

Po7 0 0 0 0 0 0

--------------------------------------------------------------------------------

Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts

--------------------------------------------------------------------------------

Po7 0 0 0 0 0 0

--------------------------------------------------------------------------------

Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err

--------------------------------------------------------------------------------

Po7 0 -- 0 0 0 0

sho inter fc1/30 counters

fc1/30

1 minute input rate 1776008 bits/sec, 222001 bytes/sec, 140 frames/sec

1 minute output rate 360048 bits/sec, 45006 bytes/sec, 81 frames/sec

4532202353 frames input, 5364654889448 bytes

0 class-2 frames, 0 bytes

4532202353 class-3 frames, 5364654889448 bytes

0 class-f frames, 0 bytes

0 discards, 0 errors, 0 CRC

0 unknown class, 0 too long, 0 too short

9317351721 frames output, 14089148468736 bytes

0 class-2 frames, 0 bytes

9317351721 class-3 frames, 14089148468736 bytes

0 class-f frames, 0 bytes

0 discards, 0 errors

0 input OLS, 0 LRR, 0 NOS, 0 loop inits

1 output OLS, 1 LRR, 0 NOS, 0 loop inits

0 link failures, 0 sync losses, 0 signal losses

0 transmit B2B credit transitions from zero

0 receive B2B credit transitions from zero

16 receive B2B credit remaining

3 transmit B2B credit remaining

0 low priority transmit B2B credit remaining

sho inter fc1/32 counters

fc1/32

1 minute input rate 222837768 bits/sec, 27854721 bytes/sec, 14937 frames/sec

1 minute output rate 86227648 bits/sec, 10778456 bytes/sec, 6377 frames/sec

119702843694 frames input, 206144432348384 bytes

0 class-2 frames, 0 bytes

119702843694 class-3 frames, 206144432348384 bytes

0 class-f frames, 0 bytes

0 discards, 0 errors, 0 CRC

0 unknown class, 0 too long, 0 too short

44140587957 frames output, 56851588018912 bytes

0 class-2 frames, 0 bytes

44140587957 class-3 frames, 56851588018912 bytes

0 class-f frames, 0 bytes

0 discards, 0 errors

2 input OLS, 2 LRR, 0 NOS, 0 loop inits

7 output OLS, 2 LRR, 4 NOS, 0 loop inits

3 link failures, 1 sync losses, 0 signal losses

0 transmit B2B credit transitions from zero

0 receive B2B credit transitions from zero

16 receive B2B credit remaining

1 transmit B2B credit remaining

0 low priority transmit B2B credit remaining

I have no flowcontrol enabled (could it solve the problem ? the SAN is connected with FC, not FCOE )

sho inter eth1/7 flowcontrol

--------------------------------------------------------------------------------

Port Send FlowControl Receive FlowControl RxPause TxPause

admin oper admin oper

--------------------------------------------------------------------------------

Eth1/7 off off off off 0 0

Did someone already experienced those issues or could someone give me advice ?

Thanking you in advance ...

sven.meinks · ‎07-15-2013

Hi,

were you able to find the cause? We got a very similar Problem here. THe only difference is, we're using NFS instead od FCoE

Thanks and regards,

Sven

dana.racine · ‎08-15-2013

Having same exact issue here. Using 5548UP, same configs on my Nexus as you - same port reports...going native FC 4Gb to a NetApp 2040 and FCoE to Emulex CNAs on HP servers. ESXi reports up to 30000 ms (yes, 30K) when doing heavy IO, at idle it is normal. No drops, nor errors at all. Windows 2012 does same thing...but even worse!

So we know it is not the OS - because it is two different OSs doing it, we know it is not the CNA, because we have Qlogic from your experience and we have Emulex and so that leaves only the NetApp or the Nexus. Well, we have iSCSI on the NetApp and do not experience this. So this leads me to believe there is some major issue with FCoE on the Nexus...particular to NetApp perhaps? We also have a NetApp 3250 with same Nexus 5548UPs, but we are doing native FCoE on the array too...so native FCoE --> to native FCoE - same issues. We should be getting 500MB/sec and we are getting 100MB second peak which drops to 50MB/sec during IO. It is like when we went from 1Gb iSCSI on Cisco 3750s and were getting 50-100MB/sec and went to now Nexus with 10Gb and FC/FCoE we are getting literally the SAME performance. Huh?

payex_rjo · ‎12-09-2013

We are having similar issues as well.

A dot1q-tunneling network with WS-C3750X-24T-L. 10Gbit btw them.

Edge-switches connected with LACP and Etherchannel to the 3750X devices.

NetApp connected via 10Gbit to 3750X. Fallback 2x1Gbit LACP to edge-switch.

VMware/ESXi 5.1 connected to edge-switch with etherchannel, 2x1Gbit.

When we activate the 10Gbit interface for the NetApp, the latency goes through the roof. Wireshark shows a lot of strange things such as duplicate acks and tcp zero window.

When we connect a ESXi directly to a 10Gbit port on a new switch, and add the NetApp there as well, this part of the communication works very well, at the same time, the remote ESXi hosts have the strange latency issues as they are traveling over the dot1q network.

Did u guys find any solution or pinpoint the issue for this behavior?

//Rob

payex_rjo · ‎02-04-2014

The reason for our issue was what the Cisco engineer called "micro bursts" that fills up the queue but isnt really showing as high bandwith usage.

So there were drops on the outgoing interface towards our 3750-stack with 4xGbit etherchannel.

We rebuilded the network and removed this device and made a true 10Gbit network. Now everything works just fine!

Steven Williams · ‎03-27-2014

IBM, thats the problem. LOL