I am conducting testing on our Fibre Channel infrastructure before we bring some new systems online. During the FC testing, I noticed some odd behavior that I don't think should be happening. Configuration:
2x UCS 6248 FIs, 2.0(4a)
2x MDS9148 5.2(2a)
B200 M3 blades with mLOM
Each FI has 4x 8Gb port-channel in trunk mode to its respective MDS switch, for a 32-Gb port channel. Our storge array is of course connected to the MDS9148 switches. From both a physical UCS blade and an ESXi server, I have IOmeter running slamming the array with traffic. The array is active/active (3PAR), and each LUN has four concurrently active paths. Round robin I/O is configured in VMware and Windows, and all storage ports show equally balanced traffic.
The problem presents itself when I physically disconnect a *single cable* in the port channel on just one fabric. At this point I would think the FIs would detect the lost link and in less than a second re-route traffic over the three remaining port-channel links. But what happens, when monitoring the storage ports is a drop down to practically 0 KB/s for 10-60 seconds for one of the test hosts (happens to either ESXi and Windows) across both fabrics. Neither VMware nor Windows logs any path failures, as no paths are down since three port channel links are still up and I/Os can reach the array. If I look in vCenter at storage performance, it shows a large disturbance in I/Os when the cable is pulled.
Now if the host had to fail over because of an entirely failed fabric, then I would expect the host MPIO software to take <30 seconds to reconfigure around the failure. But pulling one of four port channel links is transparent to the hosts and storage array, so I can't understand the big drop in I/Os.