Solved: Re: Disk Problems - Can we recover? (UCS C240 M4SX)

voip7372 · ‎05-13-2024

We have a UCS C240 M4SX that is out of coverage now (no contract) though the plan is to replace this with a new server, but for now we're stuck with it. The problem is, virtual disk 0 is down because we have 2 disks (1 and 2) that were in a predicted failure state (but still online) and someone swapped out both of those disks at the same time. As I understand it, we would have been OK if only one of the disks was replaced and then wait for the rebuild (can be several hours) and THEN replace the other disk that is predicted to fail.

Both of the replaced disks showed 'foreign data'. My idea was to have someone remove ONE of the new disks and reinsert just ONE the original disks, hoping that if those old disks were really not failed yet, having 4 originals (there are 5 in the RAID/virtual disk) would bring up the virtul disk and allow the server to rebuild the one new disk that was in the server. The issue is, the original disk that was reinserted shows as 'unconfigured good' as the State (and 'Moderate fault' for the Health), so we reinserted the other original disk (now all originals are inserted), but the two original disks that were removed and then reinserted (after a few days) both show a state of 'unconfigured good' at this time.

Is there any hope of recovering from this without losing the VMs we had on that virtual disk/datastore? If so, what could we try?

Correct me if I'm wrong, but my understanding was that with RAID 5, you can only have one disk failure. That's why I was hoping we could reinsert the old disks that were in the 'Online' State before removal ('predicted failure' for Status) and get the virtual disk back online so we could then let it rebuild the first new disk, then swap out the second original disk that was showing errors and let that second new disk get rebuilt also.

I think the biggest problem we had was that the person that swapped out one of the disks should have waited several hours for a rebuild to happen before replacing the second disk that was having errors. He didn't realize it may take hours for the rebuild to happen, so he swapped out the other disk and I think at that point, the virtual disk went offline and now we're in this situation...hoping there's a way to recover from it without rebuilding the VMs that were on this virtual disk/datastore.

This is the status BEFORE any changes were made and then the status after the two original disks were reinserted a few days later:

Wes Austin · ‎05-13-2024

If they are showing Unconfigured Good they are just in a state of waiting to be added to a new virtual drive. The replacement disks are showing "foreign config" because they probably got partial metadata from the RAID potentially when you installed 1 of the 2 new ones. All of this is just speculation without reviewing the RAID controller logging to understand exactly what happened.

I do understand what you are saying, but I do not know of any way you would be able to recover from here if your virtual drive is showing completely offline. You would need to delete the VD and recreate it.

View solution in original post

Wes Austin · ‎05-13-2024

"As I understand it, we would have been OK if only one of the disks was replaced and then wait for the rebuild (can be several hours) and THEN replace the other disk that is predicted to fail. " - This is correct

If disks were removed before/during the rebuild, the foreign data is likely incomplete and not enough to recover. If you would have not removed the Predictive failure disks (or at least one at a time) it would have allowed the RAID to rebuild. At this point, with the virtual disk offline, you will need to create a new virtual disk and recover from a backup. If the data is critical, you may find a data recovery company that could try to recover some/all of the data, but from the UCS perspective, there is not much you can do. You may get responses from others with potential workarounds.

voip7372 · ‎05-13-2024

That's what I was afraid of. The two original disks are installed again though and their State is 'unconfigured good'. The only disks that show 'foreign data' as the state are the replacement disks we tried. I would have assumed (if they disks were still in good enough shape like they were before they were removed), maybe they would show again as 'online' for the state and we would then start over and rebuild one new disk at a time.

The fact that they show up now as 'unconfigured good' means we're too late and something happened to them in the meantime or would there be some workaround to get those two synced up with the other three so we can start over with the correct process?

I looked at the log and noticed a rebuilding message during the time our person removed the second disk. I think he didn't realize it would take hours, not just 50 minutes or so (which is about how long he waited before swapping out another disk). But again, the originals are back in the server...but is it too late anyway because whatever happened with the rebuild being aborted by the second disk being removed broke the whole virtual disk because of that? Hope you know what I mean.

Wes Austin · ‎05-13-2024

If they are showing Unconfigured Good they are just in a state of waiting to be added to a new virtual drive. The replacement disks are showing "foreign config" because they probably got partial metadata from the RAID potentially when you installed 1 of the 2 new ones. All of this is just speculation without reviewing the RAID controller logging to understand exactly what happened.

I do understand what you are saying, but I do not know of any way you would be able to recover from here if your virtual drive is showing completely offline. You would need to delete the VD and recreate it.

voip7372 · ‎05-13-2024

OK. Thank you. Sounds like we'll probably have to recreate the VD (not something I've done before). If there's any good news, it's that the CUCM pub and Unity Pub are on a different server that's still running fine. So, I assume we can reinstall CUCM and Unity subs with all the same info as before (IP, etc) and the publisher will sync all the data back to these servers.

Wes Austin · ‎05-15-2024

https://www.youtube.com/watch?v=es00uvrBx4g

This should help you with creating the VD.

voip7372 · ‎05-15-2024

Thank you. This brings up another question. We don't really need the extra disk space, so we're thinking of assigning the first 5 disks (that were part of virtual drive 0) as 'hot spares'. This all got me thinking about ESXi itself (VMware), do you know WHERE that is installed? I tried to do some research and find out, but all I can tell so far is that (I think) ESXi is installed on one (or more?) of the hard drives. Is that how Cisco usually does it? (because I read in some cases, maybe not Cisco, it can be installed on a SD card in the server....but I don't think this server has any SD cards installed) Do you know where/which disk(s) Cisco normally installed ESXi on for these UCS servers? I want to be very careful not to break that. Or is it truly not on the hard drives?

Also wondering why Virtual Drives 1, 2 and 3 all show 'Cache Degraded' as the status and if that's related to Virtual Drive 0 being down because of our ongoing HDD issue in slots 1 and 2?

Wes Austin · ‎05-16-2024

Assigning 5 disks as hot spares is a little bit overkill. Maybe you can dedicated 1 or 2. This is completely up to you, but I have never seen 5 hot spares.

You would have to check the CIMC boot order to determine where your server is booting from. If the server still boots to ESXi, its not installed to VD0. If it failed to boot after your recent disk issue, its probably installed to VD0. The boot order is available in the Compute tab of the CIMC and you can see where ESXi is installed.

The cache degraded error is because you have a bad BBU/supercap or potentially you are missing the cache on the RAID controller. Depends if it ever worked before or not.

voip7372 · ‎05-16-2024

OK. Thanks for the advice.

Here's what I found on the server with the problem. See screenshots below. What is 'Bus 04 Dev 00' as it relates to a HDD? Is it a specific disk? We haven't tried rebooting the server since this happened and I'm afraid to do it now (unless I know VMware is installed on some other disk, not one in the first set of 5 which were in VD 0)

I looked at one of our other servers that has no problem (C240 M5) and noticed the first entry in its list is 'Bus 18 Dev 00'

Two screenshots from the bad server. One with hovering the mouse over 'Bus 04 Dev 00'

Wes Austin · ‎05-16-2024

Bus 04 Dev 00 is the RAID controller. You are booting from the one of the virtual disks (likely VD0).

If you reboot and ESXi loads up, you are booting from one of the other VD that are not offline.

voip7372 · ‎05-16-2024

That's what I was afraid of. I was wondering if Bus 04 is actually disk 4. If so, that disk is in VD0 and so far that disk is still online/online for the status/state. My fear was that if we delete and recreate VD0, that will wipe out ESXi also. Is that true? Sorry for my ignorance on the topic but I just want to be sure.

Also, if we rebooted it as it is now, are you saying that if ESXi is installed on either disks 3, 4 or 5 (disks in VD0 that are still online), it still wouldn't boot because VD0 has a problem or would it still boot as long as the actual disk ESXi is installed on is still online (disk 3 - 5)?

Wes Austin · ‎05-16-2024

If VD0 is where the OS is installed and its offline, ESXi is just running in memory anyway. Try to copy down any information that you may need for reinstall, like IP addresses, etc. If VMware is installed to the offline disk, its too late anyway. Once you reboot it wont come back up. If you delete VD0 it will wipe out the OS, but if the VD is offline, there is really no other option.

If ESXi is installed on VD1 VD2 or VD3, when you reboot, ESXi will come back up. I do not think this is the case, usually the OS is installed on VD0. The VD are a group of the physical disks. So VD0 is probably physical disks 1,2,3,4 and VD1 is physical disks 3,4,5,6 etc.

voip7372 · ‎05-16-2024

Thanks. I really appreciate it. In ESXI (since it's still running, probably in memory as you mentioned), I looked at Config > Storage Devices and I can't see any partition info for the one related to datastore 1 and I'm sure that's because VD0 is offline. For the other devices related to Datastore 2, 3 and 4, I see only 1 partition and from what I understand (looking at a working server), I should expect to see multiple partitions for whatever device/datastore has ESXi installed.

Anyway, looks like we'll have to reinstall it. That brings up one more question. For the VMs that were on datastore 1/VD0 which is down, I rebuilt those VMs on another datastore. If we were to delete and recreate VD0 which will likely wipe out ESXi (maybe gone already as you mentioned), once we reinstall ESXi, would datastores 2, 3 and 4 still be there and working fine with the VM files still in place so that we would hopefully just reimport/add those VMs back to the inventory and be OK? assuming we need to redo all the vswitch setup and so on...all the things we originally had in ESXi

Wes Austin · ‎05-16-2024

Once ESXi is re-installed, you should be able to see or import the datastores from the other VD that are on the server into vSphere. Once you can see those datastsores in ESXi you can add the VM back to inventory and boot them back up. Here is the process:

https://knowledge.broadcom.com/external/article?legacyId=1006160

voip7372 · ‎05-16-2024

OK, good. That's what I was hoping. So now, it's really just a matter of (tell me if you agree)...

In CIMC, delete VD0
Create VD0 again (but how?...I don't see a button to add a VD)
Does the recreated VD0 need to be 'Initialized' or anything else need to be done for it to be ready for the ESXi install?
Reboot the server and catch it in the BIOS to change the boot order so it boots from an inserted flash drive (?) that has the ESXi software on it that needs to be installed?
Configure all the settings we need in ESXi like the network info, etc.
Once ESXi is up, log in and find/add the VMs that are still on the other VDs/datastores.