09-04-2015 06:17 AM - edited 03-01-2019 12:21 PM
Hi,
Recently we have discovered that when UCS Server Ports are in port-channel mode, loss of a member link causes disruption to the FC traffic in a VMware environment. VMware host never sees a path down and guest go into a IO timeout wait. Once the guest timeout wait period has expired, a SCSI reset re-establishes connectivity to storage. We have replicated at multiple customers and have been told that this is expected behavior. This is of concern to us and how it affects design decisions.
Any input is appreciated.
Thank you.
Solved! Go to Solution.
09-04-2015 06:50 AM
It's true, that you have to wait on SCSI timeout, and then the OS should do a flogi again. I assume of course, that you have FC multipathing setup and working; therefore no disruption of the traffic.
Without port-channel, the vhba's on the failing fabric will go down; and FC multipathing is the only solution.
09-04-2015 06:50 AM
It's true, that you have to wait on SCSI timeout, and then the OS should do a flogi again. I assume of course, that you have FC multipathing setup and working; therefore no disruption of the traffic.
Without port-channel, the vhba's on the failing fabric will go down; and FC multipathing is the only solution.
09-04-2015 09:15 AM
Hi Walter,
Thanks for the confirmation. We have spent a lot of time troubleshooting this and were surprised to find out that it's expected behavior. Do you see customers moving away from port-channels on the IOM Server ports due to this?
Kevin
09-04-2015 09:31 AM
Hi again Walter,
Just re-read your response. We are testing with Linux guests that have timeout set to 180 seconds by vmtools. When we bring a port down, the host does not see a loss of path to the lun. The guests all go into a timeout state with IO completely halted. Once they hit 180 seconds, the guest sends a SCSI reset which then causes the host to recognize that the path is down and move to an active path. The guests then resume IO. If we lower the timeout in the guest, it shortens the IO halt successfully.
Decreasing the timeout on all guests is not an option. Any vmtools update would revert it back. So our thought is that we have to switch back to non-port-channel mode for all customers. Unless they are comfortable with a potential 3 minute pause in production systems...
Do you have any thoughts on this?
Thanks! It's great to find someone familiar with the issue.
Kevin
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide