We had an instance where 4 out of six blades lost connectivity to 1 LUN. For some odd reason, this caused our ESXi hosts to become non-responsive although the VMs stayed up. The solution was to remap the LUN from our fibre storage array (Compellent) to those hosts. But we still had to reboot all the esxi hosts.
event Autopsy led to two possible issues... 1) I had neglected to change the VSAN VLAN ID from the default of 1. We think this might have been an issue because we trunk 1 from our Foundry(Brocade) core switch down to the UCS. 2) We are hard zoned (with brocade 8GB fibre switches), which some people have said is a disaster waiting to happen and others have said it's just fine.
Which of these do you think is most likely to have been the problem? We are going live on SAP next week on our UCS and heads will roll if we have any more "glitches"
VSAN 1/VLAN 1 are defaults and are generally used for Management purposes, best practices would be to use something other than VSAN 1/VLAN 1. Hard zoning is more secure than soft zoning. I would not initially suspect either of these to have caused this issue.
I would not expect lun loss of connectivity to result in an ESXi host to become unresponsive. Typically these issues can be resolved by a rescan after connectivity to the lun has been restored. Have you collected a vm-support? You may also want to consider perforning a KB search on VMwares web site based on the version of ESXi that you are running.
Thanks for your response Bill. It was definitely bizarre, that's for sure. The triggering event seemed to be when we tried to map some LUNs to two new blades in the UCS. We got an error on our Compellent and then things just got weird after that. My memory has faded a bit as I've had a vacation and vmworld so I probably won't find an answer. I don't mind problems so much when I can find somebody else with the same problem. But I posted on vmware's forum and all I got were crickets chirping. I wish I knew why the vmware hosts went all zombie on me. Basically, the vmkernel log was out of control with storage errors that would not resolve until a reboot of the host. I couldn't perform a rescan. Since the hosts were unresponse, vmotions were out of the question.