Solved: UCS upgrade from 2.2(6e) to 3.1(3b) caused major outage during FI-B reboot

dylan.ebner · ‎07-19-2017

Hello-

Today while performing an upgrade from 2.2(6e) to 3.1(3b) we had a major outage during the reboot of FI-B. The outage cleared after the 10 or so minutes it took for FI-B to restart.

At first, I thought we maybe had some of our servers configured incorrectly and the vNIC or vHBA paths were all going to the same FI. However, after things came back up and we could get back into vcenter we learned that couldn't be possible. In multiple instances, we had two guests on the same host, yet one guest was available during the outage and the other wasn't.

We also checked the vmware logs on the servers, and half the NICS went down as expected, but vcenter lost access to the servers.

I have a TAC case open, but I thought I would reach out to the community to see if anyone has had a similar issue.

We are waiting to acknowledge the reboot of FI-A until we hear back from TAC.

Thanks

Kirk J · ‎07-20-2017

Greetings.

Do you happen to be running ESX 6.5?

Please make sure you aren't bumping into https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvd52310

It was reported on 3.12 versions, but not clear what other versions are impacted.

Of particular interest is the Further problem description from the bug:

Further Problem Description:
When one FI is rebooted, VMs using the vmnic pinning to the FI rebooted would lose network connectivity.
When the FI is rebooting, the vmnic on this FI would be down from VMware, however, the VMs don't failover to the other vmnic, they still pin to the same vmnic although it is already down. from "esxtop" "n" output, we can still see the VMs use the same vmnic as for uplink as before. because of this the VM's MAC are not learned on the other FI so the network connectivity is lost.

I would try disabling a vnic from the equipment tab view, and check the esxtop output for the esxi host, and see if this is a match.

Thanks,

Kirk...

View solution in original post

Walter Dey · ‎07-20-2017

I hope you scheduled a maintenance window for doing this upgrade;

If you have checked before, if your multihoming is working ok, this should indeed not happen.

My advice: before doing this, open a preventive TAC case and let them analyze if you hit any knows bugs.

dylan.ebner · ‎07-20-2017

Hi Kirk-

This is exactly the bug we were bumping into. TAC figured it out late yesterday.

Thanks

Kirk J · ‎07-20-2017

Greetings.

Do you happen to be running ESX 6.5?

Please make sure you aren't bumping into https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvd52310

It was reported on 3.12 versions, but not clear what other versions are impacted.

Of particular interest is the Further problem description from the bug:

Further Problem Description:
When one FI is rebooted, VMs using the vmnic pinning to the FI rebooted would lose network connectivity.
When the FI is rebooting, the vmnic on this FI would be down from VMware, however, the VMs don't failover to the other vmnic, they still pin to the same vmnic although it is already down. from "esxtop" "n" output, we can still see the VMs use the same vmnic as for uplink as before. because of this the VM's MAC are not learned on the other FI so the network connectivity is lost.

I would try disabling a vnic from the equipment tab view, and check the esxtop output for the esxi host, and see if this is a match.

Thanks,

Kirk...