Re: New UCS-C220-M5 "lost" its RAID controller two months in

ac513 · ‎05-05-2022

Had a weird problem last night and this morning, and was curious if anyone has seen such a thing, if it might be a known quirk, or if this might be either a fluke or impending hardware failure. I've opened a TAC case, but wanted to bounce this event off the community as well.

About two months ago, I deployed a standalone UCS-C220-M5 as one of five new domain controllers in our environment. Or at least "new" to me -- It was brand new in box, but had a mid-2020 manufacture date, and firmware version 4.1(3c) It was a pretty basic install with all factory default settings, two SSDs attached to the Cisco 12Gb SAS RAID controller (the one that shows up as an Avago controller) in a RAID1 virtual drive, and Windows Server 2019 installed as a UEFI boot option.

I ran Windows Updates on this box last night after it has been in service & rebooting/patching just fine for its two months of service thus far. However, upon rebooting this time, the server was no longer bootable. The server landed at the system default EFI shell for lack of any other boot options available. I hopped into the CIMC to look for issues, but was not finding anything at first glance. The RAID controller was showing as "online", both SSDs were showing "online", and the RAID1 volume containing the server OS was showing as online. I attempted to research and fix this issue with the idea that the EFI boot entry for the OS simply vanished for some reason, perhaps due to a bug. After some trial and error and finding a lot of Cisco forums posts on similar-but-not-exactly-the-same issues with boot devices, I came up short and gave up sometime in the middle of the night.

Came into the office first thing this morning and compared anything & everything in the CIMC and UEFI settings between this server and another similar piece of hardware pending deployment. Eventually found that in the UEFI settings of the second server, the RAID controller was showing as an available device in one specific submenu a few levels deep into advanced settings. That caught my attention because I didn't recall seeing that moments earlier in the same spot in broken server's UEFI settings. I went back to the CIMC of the broken server and lo and behold, the RAID controller and virtual drive were still showing as "online", but now there was a large red-text message stating "this is cached data".

It sounded like it might be a RAID controller issue or another hardware issue further back in the chain that might need a hard power cycle and/or controller reseat to fix. So I drove to the datacenter where the server is located, hard power cycled it, and then reseated the RAID controller riser card as well as all cabling between the card & SAS backplane.

Powered everything back up, and voila. RAID controller was detected, Windows Boot Manager's EFI entry was showing in the boot order, and the server booted up as expected. Once it was up, made sure replication caught up and finished correctly and then we rebooted successfully one more time.

Diagnostics never showed any impending failures from what I can tell, so not sure if this was a fluke or an intermittent hardware/firmware issue. And we're all now wondering when it will happen again.

Anyone seen similar weirdness with a C220-M5 or similar? Any ideas of what I need to be checking on next, besides firmware updates?

Kirk J · ‎05-06-2022

If the unit is under contract, open a TAC case.

Would need to take a look at the CIMC support bundle (and the raid controller internal logs that come with it).

Logs should indicate if the raid controller experienced some sort of reset.

Raid controllers have their own firmware and mini OS, and can have issues occasionally, that are usually resolved via firmware updates.

Kirk...

New UCS-C220-M5 "lost" its RAID controller two months into service