Solved: Possible I/O bug

jseide86 · ‎05-29-2018

Hello

I've been working on a strange issue with VMware for a good while now where I'm seeing that virtual machines running any flavor of Windows and Windows Server, in our vSphere environment that runs on Cisco UCS blade and rack servers, are showing symptoms where on every third reboot inside a window of 10-15 minutes will boot into either Windows Recovery Mode or a bluescreen (NTFS FILE SYSTEM).

The problem also affects VMware Horizon (VDI) Linked Clones and Instant Clones where approximately 30% of provisioned clones will boot to Windows Recovery Mode or bluescreen (NTFS FILE SYSTEM), causing the provisioning to stall and deployment of desktop pools partially fail.

We've been doing a lot of testing with VMware Global Support Services where we've been trying to pinpoint the root cause of the issues we're seeing. Neither ESXi, vCenter nor Windows logs have been helpful, and we've been doing thorough testing of VMs running on the following:

* Most (if not all) releases of ESXi, from ESXi 6.0 GA through ESXi 6.5 update 2

* All VM Tools versions going back to vSphere 5
* VMware Virtual Hardware versions 9 through 13.
* Different vSphere environments (three different vCenter envs, different versions)
* Different kinds of storage backing for the VMs, including FibreChannel SAN, NAS (IP storage) and local disk
* Different Windows versions, from Server 2008 to Server 2016, as well as Windows 7 and 10.
* Different UCS generations and server types, mainly B200 M3 and M4, as well as C240 M4.
* Different UCS firmware versions 3.2(2b) and 3.2(3a)

Every configuration of software and hardware has presented the issue in our environment. When asking fellow VMware administrators, a few have been suggesting this to be an I/O issue somewhere in UCS. I've been looking over the policies we're using in our Service Profile Templates for our UCS servers and confirmed that all of them are set up as per best practices provided from Cisco in the UCS Administrator guides.

VMware has given up on troubleshooting this issue and have suggested we contact Microsoft for assistance to reading the Windows system logs to figure out what goes wrong during boot. They did look at these logs themselves, but couldn't see anything revealing.

My team and I are stumped by this. Clearly we're seeing an issue that's present somewhere deep in our infrastructure, on all of our ESXi hosts in all of our datacenters. We've been testing so many configurations of hardware and software now that the only suggestion we have left is that this is caused by something in firmware.

I could not find a known issue that describes anything like this, but perhaps I'm looking at it wrong. Are there any known issues in firmware that would cause this behaviour?

Thanks for reading.

jseide86 · ‎06-04-2018

Thanks for your reply, it's somewhat comforting to hear that another org. is seeing the same issues, if you know what I mean.

@adsfasdfasdfasdfasdf wrote:

They wanted us to disable recovery mode (to catch the blue screen), but every time it's been disabled, the server has rebooted without issue.

I tried this;

bcdedit /set bootstatuspolicy ignoreallfailures

bcdedit /set recoveryenabled No

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /set {default} recoveryenabled No

And it resulted in Windows booting normally. I've only tried it on one VM so far, but I ran my reboot script to force multiple reboots on several VMs side-by-side to provoke the issue, and the VM that got Recovery Mode disabled never once had an issue (while the others un-modified VMs still experienced the same issue - recovery mode on every third boot).

This is interesting, but not really a fix. It may work for our Linked/Instant Clones VMs (as they don't really need a Recovery function), but it might not be the best idea to disabled recovery mode and failure-detection on our production servers.

I'll bring it up during my next call with Cisco, thanks for posting!

View solution in original post

Kirk J · ‎05-29-2018

Greetings.

You may want to look at details for https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/SA/SW_Advisory_CSCvj32984.html. That primarily impacts Fiber channel.

I would check all your interfaces between FIs and IOMs to make sure you aren't seeing CRCs.

I would definitely open a TAC case, if you don't already have one open.

You may also may need to consider setting up a port mirror in the UCSM as well as a span capture on the Eth switches above the FIs to see if frame corruption is occurring anywhere.

You mention local storage as well,,, which would not be susceptible to fc or ip based storage protocol transport issues.

Thanks,

Kirk...

jseide86 · ‎05-29-2018

Thanks Kirk, I'll get around to open a TAC case on this as soon as I can.

I've seen the Software Advisory you linked earlier, but we've ruled it out as a culprit as the issue occurs when we use Local Storage.

We think the issue is compute-related, so the servers themselves. As we see the issue on both rack and blade, we're thinking firmware.

Good tip to check for CRCs and frame corruption, I'll get around to do that as well.

Thanks!

Kirk J · ‎05-29-2018

When did this issue first appear?

Can you confirm you still see this issue when you create a generic VM (not from clone or template), and install windows?

Also, can you spin up a couple of linux guestVMs, and run them through the same reboot tests?

Thanks,

Kirk...

jseide86 · ‎05-29-2018

I noticed the issue this time last year in a new UCS pod we were setting up for VMware Horizon. I haven't pursued it until recently because I didn't have a need for Linked and Instant clones in our environment.

I have tested this on non-template VMs as well as on VMs deployed from a template, there's no difference.

I have tested Linux VMs. They are more resilient to the issue, but the GRUB boot loader notices something wrong and halts boot to wait for the user to select a boot option instead of booting straight into the OS as it should.

adsfasdfasdfasdfasdf · ‎06-01-2018

My organization has been experiencing a similar issue - when certain Windows Server VMs reboot, they go into recovery mode. In our environment, we have seen this when:

* The VMware host has been upgraded from vSphere 5.5 Update 3a to vSphere 6.5 U1 (build 7388607) or 6.5 U2 (build 8294253) using VMware Update Manager (i.e. in place upgrade). Until I read this post, I hadn't considered the Cisco drivers and firmware which are also being updated.

* Standalone (not UCS FI connected) C240 M4's running/upgraded to firmware version 3.0(3f) and 3.0(4d). We have not tried the upgrade on any Cisco blades yet.

* Multiple versions of Windows Server - 2008 and 2012 R2 (I don't think we've seen it on Win2016). We don't have any Linux VMs on any of the upgraded clusters.

* Not all Windows guests experience this. In one cluster, out of 80+ Windows VMs, only 20 booted into recovery mode. Once they reboot to recovery mode once, it hasn't happened again (but we haven't tried 3 consecutive reboots in a 15-min window)

* Multiple versions of VMware Tools (sorry, I didn't track the actual versions)

* Multiple versions of VMware virtual hardware - versions 8 and 10

* All hosts/guests affected are connected to the same vCenter Server - 6.5 U1c

* I believe all of the Windows guests have been on NetApp MetroCluster NFS storage (but 95% of our VMs are on NFS)

A case was opened with VMware Support, they looked through the hosts logs, and they were quick to declare it wasn't their issue.

A case was opened with Microsoft, logs were collected, but they couldn't find anything. They wanted us to disable recovery mode (to catch the blue screen), but every time it's been disabled, the server has rebooted without issue.

jseide86 · ‎06-04-2018

Thanks for your reply, it's somewhat comforting to hear that another org. is seeing the same issues, if you know what I mean.

@adsfasdfasdfasdfasdf wrote:

They wanted us to disable recovery mode (to catch the blue screen), but every time it's been disabled, the server has rebooted without issue.

I tried this;

bcdedit /set bootstatuspolicy ignoreallfailures

bcdedit /set recoveryenabled No

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /set {default} recoveryenabled No

And it resulted in Windows booting normally. I've only tried it on one VM so far, but I ran my reboot script to force multiple reboots on several VMs side-by-side to provoke the issue, and the VM that got Recovery Mode disabled never once had an issue (while the others un-modified VMs still experienced the same issue - recovery mode on every third boot).

This is interesting, but not really a fix. It may work for our Linked/Instant Clones VMs (as they don't really need a Recovery function), but it might not be the best idea to disabled recovery mode and failure-detection on our production servers.

I'll bring it up during my next call with Cisco, thanks for posting!

adsfasdfasdfasdfasdf · ‎06-04-2018

I tried this;

bcdedit /set bootstatuspolicy ignoreallfailures

bcdedit /set recoveryenabled No

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /set {default} recoveryenabled No
And it resulted in Windows booting normally. I've only tried it on one VM so far, but I ran my reboot script to force multiple reboots on several VMs side-by-side to provoke the issue, and the VM that got Recovery Mode disabled never once had an issue (while the others un-modified VMs still experienced the same issue - recovery mode on every third boot).

Exactly what happened to us. We made the bcdedit changes (per Microsoft) on 11 servers, and they rebooted without issue. There was one Windows VM that got missed, and it booted into Recovery Mode! So, we weren't sure if just the act of logging into the server (with RDP) caused it to not go into Recovery Mode or if the bcdedit command did it, or was it just luck...

So to mirror your testing - if we reboot a test Windows server 3 times within 10-15 minutes, it should boot into Recovery Mode one of those times? Edit: Reboot it without changing the bcdedit settings.

jseide86 · ‎06-05-2018

@adsfasdfasdfasdfasdf wrote:
So to mirror your testing - if we reboot a test Windows server 3 times within 10-15 minutes, it should boot into Recovery Mode one of those times? Edit: Reboot it without changing the bcdedit settings.

I'm not 100% certain on the timeframe, but I do know that if you wait long enough between reboots the issue won't occur. I thought I tested it to be around 15 minutes, but I can't find my notes so I cannot confirm it right now.

I do know that if you rapidly reboot the VMs, you're able to reproduce the issue.

Here, i'll share my test method:

VMware PowerCLI script (a bit crude but works): https://pastebin.com/RVsMSH7q

In practice it looks like this:

jseide86 · ‎07-26-2018

No I/O bug, this is intended behaviour on Windows VMs. Run the commands mentioned in the post above if you often need to reboot your VM.

adsfasdfasdfasdfasdf · ‎09-13-2018

jseide86 - just curious - did you do any more troubleshooting with this? I'm the person who originally replied to this. We are seeing this again after we upgraded another cluster from 5.5 U3 to 6.5 U1. Yet 4+ other clusters have been upgraded without causing this issue!

jseide86 · ‎09-14-2018

We eventually ruled out Cisco from this problem as we managed to replicate it on a set of HPE servers.

Also, if you do a graceful restart of the VMs (not just "Reset guest" via VMRC) the machines do not exhibit the Recovery Mode behaviour.

Our take from this is that the "every third reboot" issue is an intended feature in Windows where it triggers recovery if the machine shuts down improperly too many times for it to be normal.

In other words, no bug. Working as intended.

adsfasdfasdfasdfasdf · ‎09-14-2018

OK, so your issue is explainable, but ours is not. Our Windows servers have a scheduled task that does an "OS friendly" reboot once a month. After the hosts have been upgraded from 5.5 to 6.5 and the VM moves to a 6.5 host, some (not all) of the VMs go into the recovery console on the first scheduled reboot. Much like your scenario - it seems we can stop this from happening by disabling the Recovery Console by changing recoveryenabled to No via the command:

bcdedit /set {default} recoveryenabled NO

Thanks for replying!