01-21-2016 01:35 PM - edited 03-01-2019 12:33 PM
Configuration Notes:
Description/Troublshooting Performed:
vmk ping'd the NetApp with a count of 100. No dropped packets or excessive latency seen. We also validated our jumbo framing by vmkping'ing the data store with a size of 8000 and the df bit set.
Has anyone seen issues like this before? At first we thought it could have been a compatability matrix issue but as far as I can see from the compatability checker tool we are fine. Any thoughts/ideas would be greatly appreciated as we are all scratching our heads. Thanks in advance for comments or suggestions.
Edit: updated troubleshooting steps taken. 3/24/2016 1117 EST
01-21-2016 02:40 PM
For storage and vmotion you have vmkernel interfaces.
Do you have multiple vmkernel interfaces in the same IP subnet ?
What is the exact configuration of this vmkernel interfaces.
01-21-2016 03:56 PM
The vmstorage vmkernel interfaces for each host are on the same subnet / VLAN. Additionally, the vfiler has a vmstorage VLAN VIF in the vmstorage subnet with an IP address from the vmstorage subnet. vMotion's vmkernel IPs are on a separate subnet. It, as well, has a VIF on the vfiler with its own IP address, separate from the vmstorage.
01-23-2016 12:46 PM
Bump.
01-24-2016 04:51 AM
Greetings.
Assuming your UCSM is in the default 'end host mode' can you confirm all your northbound links have 'spanning-tree port type edg trunk' configs? Need to make sure your UCSM uplinks aren't needlessly getting entangled in STP events.
It might be helpful to get a packet capture of the NFS VMK, and see if we are getting re-transmits, arp issues, etc:
01-25-2016 09:37 AM
Kirk,
Thanks for the reply! We have checked the northbound links and each of the VPCs are set to spanning-tree port type edge trunk.
We also monitored the tcp dumps from the hosts CLI and have not seen any retransmits, resets or excessive arp'ing from the hosts to the NetApp.
We also vmk ping'd the NetApp with a count of 100. No dropped packets seen.
02-01-2016 02:13 PM
According to the datastore performance tab in vsphere, this issue seems to have been fixed but I'm not thoroughly convinced...
Earlier today, we completely removed all flow control settings between UCS, 5548s and NetApp. Prior to removing the flow control settings the datastore latency appeared to have lowered. I don't have specific numbers as I was removing flow control, but the latency was lower than what I saw last week. Since removing flow control, we've seen latency averages drop further. Currently, between 5ms and 1ms on the average. Not sure if this is the real fix or if the issue "went away" as it has in the past. Continuing to monitor.
On a side note, i went through all of the interfaces between UCS and the NetApp looking for tx/rx pause frames. In UCS manager i didn't see any pause frames. On the 5548s i did not see any pause frames between the 5548s to UCS. However, when i looked at the links from the 5548s to the netapp, i saw rx pause frames from the netapp. When i went to the netapp and looked at the ifstat summary (netapp's version of "sh int") for each uplink to the 5548s. I expected to see tx pause frames sent to the 5548s but the counters on the netapp it showed 0! Neither the 5548s or netapp have had any counters cleared in the last 30 days AFAIK. Thinking that the counter for pause from could be counted at the virtual interface on the netapp, i checked there and it doesn't have a counter for pause frames. However, it did have a queue overflow counter but it was 0 as well. Completely confused.
On a side-side note, i'm now going to be looking for dropped frames. If flow control was working as intended it could have been saving us from a much larger issue.
02-03-2016 06:53 AM
It appears as if my gut feeling was right. Disabling flow control did not resolve the issue. I re-enabled the flow control settings and the latency did return, however, when i re-disabled it, the latency issues persisted.
To aid in troubleshooting we have installed evaluation copies of SolarWinds Storage Resource Monitor, Virtualization Manager, and Server & Application Manager. Hopefully this will shed some light on where the issue is.
06-25-2020 11:37 AM
I know this is a few years old but some nice sales person managed to get unknowing purchasers to get older hardware and we are having the same issues you had when this was written. Did you ever find a solution for this issue?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide