Kirk,

wade.patterson.3.ctr · ‎01-21-2016

Configuration Notes:

UCS v2.1(3f)
x4 B200 M2 w/ M81KR VIC (Palo)
4 vNICs provisioned per ESX host. 2 for Production traffic / 2 for Storage & vMotion (each are on separate vlans w/ no SVI, flat VLANs).
Storage & vMotion vNICs: 1 connected to Fabric A / 1 connected to Fabric B.
Chassis Discovery: 4-link w/ IOM 2104XP / Link Grouping Preference: None
Storage VLAN is L2-only
6248s are dual VPC'd to a pair of 5548s
5548s are Etherchannel-linked to 2 6500s w/ dual Sup720 (no VSS)
6500-A is the root bridge for the Storage VLAN
vSphere version 5.1 update 3
2 vDS provisioned: 1 for Production traffic and vMgmt / 1 for Storage & vMotion
NIC Teaming & Load Balancing: IP Hash
FAS 3170 v8.1.2P4 7-Mode
NFS v3 exports hosted on a vfiler w/ VIFs. FAS1 hosts the vfiler and only contains one vfiler.
Dual 3170 Filer heads are VPC'd to the 5548s

Description/Troublshooting Performed:

It was noticed that our ESX hosts were experiencing high read/write latency to the VM Datastore (NFS export) via both the A and B paths. 138ms write latency / 50ms read latency for the B path, 20-30ms less for the A path. Normal latency is considered to be >5 ms write / >1ms read.
Latency issues are seen when a hosts use either the A or B path to get to a datastore. We tested this by removing the A-Fabric vNIC from the ESX Host, rebooted the host and then monitored the Performance tab > Datastore in vSphere while a VM (guest OS) booted. When we re-added the A-fabric vNIC to the host, removed the B-fabric vNIC and rebooted, the latency issues returned but with "less" latency (20-30ms less).
All L2 pathing and spanning tree configurations have been checked and found to be correct. As a test measure, we changed the root bridge for vStorage to 5548a and then changed the root bridge to 5548b. Same results: the B-Fabric vNIC still experiences latency. We have reset the root bridge to its original configuration (6500a).
We have checked the Port-Channel configurations between the 5548s and the 6248s and found that spanning tree port type network was correctly set as suggested below.
Monitored tcp dumps from the hosts CLI and have not seen any re-transmits, resets or excessive arp'ing from the hosts to the NetApp.
vmk ping'd the NetApp with a count of 100. No dropped packets or excessive latency seen. We also validated our jumbo framing by vmkping'ing the data store with a size of 8000 and the df bit set.
Studied the TX/RX bandwidth utilization for each vPC to the FIs and the FAS's. The most utilization seen was 4% TX / 10% RX from 5548a to its 10Gbe interface to FAS2. The 2nd interface of the vPC on 5548b showed 0% TX / RX
Investigated the NetApp as a potential culprit. Output of the nfs_hist command shows minimal latency for read and write operations, 0.01ms avg read / 1.95ms write. Output of the sysstat -x 1 shows minimal CPU utilization 2-18% avg, spikes as high as 36%. Disk utility shows 2-7% avg w/ spikes up to 20%. CP ty shows normal flushing to WAFL (dashes and Tf codes in CP ty column). I honestly think that the netapp is sleeping, yet the latency still exists which leads me to believe that the issue is elsewhere but I've gotten NetApp involved to get an outside opinion.
Reconfigured Flow Control on all links between the Netapp and UCS. When we first disabled flow control the latency issues started to drop. To validate that flow control was the culprit, we turned it back on, watched the latency climb and then shut it off. Ironically, when we shut off flow control, the latency did not decrease as we'd seen before. I think it safe to say that flow control is not the issue and what we saw was just a coincidence.
Attempted to isolate the path to the netapp from the 5548s. We disabled the 5548B vPC uplink to fas1 but the latency did not subside. Attempted to do the same test from 5548A to FAS1 but again, the latency did not subside.
VMWare dispatched a health assessment technician to visit our site to to look over the status of the vCenter/ESXi environment and provide a configuration audit. A few recommendations were to install VAAI, VSC in order to ensure NFS datastore best practices configurations are applied when new volumes / datastores are created. One recommendation that stood out was changing the NIC Teaming & Load Balancing configuration to Load Based Teaming instead of IP Hash. A few other recommendations were to configure NetIOC Fault Tolerance policy high, apply network traffic flow policies, consistently configure SIOC at the datastore level and set the storage congestion threshold to a value based on the disk type at the storage array. Will implement these changes and see what effect it has.

Has anyone seen issues like this before? At first we thought it could have been a compatability matrix issue but as far as I can see from the compatability checker tool we are fine. Any thoughts/ideas would be greatly appreciated as we are all scratching our heads. Thanks in advance for comments or suggestions.

Edit: updated troubleshooting steps taken. 3/24/2016 1117 EST

Walter Dey · ‎01-21-2016

For storage and vmotion you have vmkernel interfaces.

Do you have multiple vmkernel interfaces in the same IP subnet ?

What is the exact configuration of this vmkernel interfaces.

wade.patterson.3.ctr · ‎01-21-2016

The vmstorage vmkernel interfaces for each host are on the same subnet / VLAN. Additionally, the vfiler has a vmstorage VLAN VIF in the vmstorage subnet with an IP address from the vmstorage subnet. vMotion's vmkernel IPs are on a separate subnet. It, as well, has a VIF on the vfiler with its own IP address, separate from the vmstorage.

wade.patterson.3.ctr · ‎01-23-2016

Bump.

Kirk J · ‎01-24-2016

Greetings.

Assuming your UCSM is in the default 'end host mode' can you confirm all your northbound links have 'spanning-tree port type edg trunk' configs? Need to make sure your UCSM uplinks aren't needlessly getting entangled in STP events.

It might be helpful to get a packet capture of the NFS VMK, and see if we are getting re-transmits, arp issues, etc:

tcpdump-uw -i vmk3 -s 9014 -B 9 (if you are using jumbo frames, and where VMK3 is your NFS VMK)
tcpdump-uw -i vmk3 -s 1514 (if you aren't using jumbo frames, and where VMK3 is your NFS VMK)

If you issue a vmkping from your NFS vmk, does it appear you ever lose any packets?

Thanks,

Kirk...

wade.patterson.3.ctr · ‎01-25-2016

Kirk,

Thanks for the reply! We have checked the northbound links and each of the VPCs are set to spanning-tree port type edge trunk.

We also monitored the tcp dumps from the hosts CLI and have not seen any retransmits, resets or excessive arp'ing from the hosts to the NetApp.

We also vmk ping'd the NetApp with a count of 100. No dropped packets seen.

wade.patterson.3.ctr · ‎02-01-2016

According to the datastore performance tab in vsphere, this issue seems to have been fixed but I'm not thoroughly convinced...

Earlier today, we completely removed all flow control settings between UCS, 5548s and NetApp. Prior to removing the flow control settings the datastore latency appeared to have lowered. I don't have specific numbers as I was removing flow control, but the latency was lower than what I saw last week. Since removing flow control, we've seen latency averages drop further. Currently, between 5ms and 1ms on the average. Not sure if this is the real fix or if the issue "went away" as it has in the past. Continuing to monitor.

On a side note, i went through all of the interfaces between UCS and the NetApp looking for tx/rx pause frames. In UCS manager i didn't see any pause frames. On the 5548s i did not see any pause frames between the 5548s to UCS. However, when i looked at the links from the 5548s to the netapp, i saw rx pause frames from the netapp. When i went to the netapp and looked at the ifstat summary (netapp's version of "sh int") for each uplink to the 5548s. I expected to see tx pause frames sent to the 5548s but the counters on the netapp it showed 0! Neither the 5548s or netapp have had any counters cleared in the last 30 days AFAIK. Thinking that the counter for pause from could be counted at the virtual interface on the netapp, i checked there and it doesn't have a counter for pause frames. However, it did have a queue overflow counter but it was 0 as well. Completely confused.

On a side-side note, i'm now going to be looking for dropped frames. If flow control was working as intended it could have been saving us from a much larger issue.

wade.patterson.3.ctr · ‎02-03-2016

It appears as if my gut feeling was right. Disabling flow control did not resolve the issue. I re-enabled the flow control settings and the latency did return, however, when i re-disabled it, the latency issues persisted.

To aid in troubleshooting we have installed evaluation copies of SolarWinds Storage Resource Monitor, Virtualization Manager, and Server & Application Manager. Hopefully this will shed some light on where the issue is.

douglas.long · ‎06-25-2020

I know this is a few years old but some nice sales person managed to get unknowing purchasers to get older hardware and we are having the same issues you had when this was written. Did you ever find a solution for this issue?

VM Datastore Latency