05-13-2024 07:34 AM - edited 05-13-2024 07:48 AM
We have a UCS C240 M4SX that is out of coverage now (no contract) though the plan is to replace this with a new server, but for now we're stuck with it. The problem is, virtual disk 0 is down because we have 2 disks (1 and 2) that were in a predicted failure state (but still online) and someone swapped out both of those disks at the same time. As I understand it, we would have been OK if only one of the disks was replaced and then wait for the rebuild (can be several hours) and THEN replace the other disk that is predicted to fail.
Both of the replaced disks showed 'foreign data'. My idea was to have someone remove ONE of the new disks and reinsert just ONE the original disks, hoping that if those old disks were really not failed yet, having 4 originals (there are 5 in the RAID/virtual disk) would bring up the virtul disk and allow the server to rebuild the one new disk that was in the server. The issue is, the original disk that was reinserted shows as 'unconfigured good' as the State (and 'Moderate fault' for the Health), so we reinserted the other original disk (now all originals are inserted), but the two original disks that were removed and then reinserted (after a few days) both show a state of 'unconfigured good' at this time.
Is there any hope of recovering from this without losing the VMs we had on that virtual disk/datastore? If so, what could we try?
Correct me if I'm wrong, but my understanding was that with RAID 5, you can only have one disk failure. That's why I was hoping we could reinsert the old disks that were in the 'Online' State before removal ('predicted failure' for Status) and get the virtual disk back online so we could then let it rebuild the first new disk, then swap out the second original disk that was showing errors and let that second new disk get rebuilt also.
I think the biggest problem we had was that the person that swapped out one of the disks should have waited several hours for a rebuild to happen before replacing the second disk that was having errors. He didn't realize it may take hours for the rebuild to happen, so he swapped out the other disk and I think at that point, the virtual disk went offline and now we're in this situation...hoping there's a way to recover from it without rebuilding the VMs that were on this virtual disk/datastore.
This is the status BEFORE any changes were made and then the status after the two original disks were reinserted a few days later:
Solved! Go to Solution.
05-13-2024 12:02 PM
If they are showing Unconfigured Good they are just in a state of waiting to be added to a new virtual drive. The replacement disks are showing "foreign config" because they probably got partial metadata from the RAID potentially when you installed 1 of the 2 new ones. All of this is just speculation without reviewing the RAID controller logging to understand exactly what happened.
I do understand what you are saying, but I do not know of any way you would be able to recover from here if your virtual drive is showing completely offline. You would need to delete the VD and recreate it.
05-13-2024 11:19 AM - edited 05-13-2024 11:19 AM
"As I understand it, we would have been OK if only one of the disks was replaced and then wait for the rebuild (can be several hours) and THEN replace the other disk that is predicted to fail. " - This is correct
If disks were removed before/during the rebuild, the foreign data is likely incomplete and not enough to recover. If you would have not removed the Predictive failure disks (or at least one at a time) it would have allowed the RAID to rebuild. At this point, with the virtual disk offline, you will need to create a new virtual disk and recover from a backup. If the data is critical, you may find a data recovery company that could try to recover some/all of the data, but from the UCS perspective, there is not much you can do. You may get responses from others with potential workarounds.
05-13-2024 11:28 AM
That's what I was afraid of. The two original disks are installed again though and their State is 'unconfigured good'. The only disks that show 'foreign data' as the state are the replacement disks we tried. I would have assumed (if they disks were still in good enough shape like they were before they were removed), maybe they would show again as 'online' for the state and we would then start over and rebuild one new disk at a time.
The fact that they show up now as 'unconfigured good' means we're too late and something happened to them in the meantime or would there be some workaround to get those two synced up with the other three so we can start over with the correct process?
I looked at the log and noticed a rebuilding message during the time our person removed the second disk. I think he didn't realize it would take hours, not just 50 minutes or so (which is about how long he waited before swapping out another disk). But again, the originals are back in the server...but is it too late anyway because whatever happened with the rebuild being aborted by the second disk being removed broke the whole virtual disk because of that? Hope you know what I mean.
05-13-2024 12:02 PM
If they are showing Unconfigured Good they are just in a state of waiting to be added to a new virtual drive. The replacement disks are showing "foreign config" because they probably got partial metadata from the RAID potentially when you installed 1 of the 2 new ones. All of this is just speculation without reviewing the RAID controller logging to understand exactly what happened.
I do understand what you are saying, but I do not know of any way you would be able to recover from here if your virtual drive is showing completely offline. You would need to delete the VD and recreate it.
05-13-2024 12:35 PM
OK. Thank you. Sounds like we'll probably have to recreate the VD (not something I've done before). If there's any good news, it's that the CUCM pub and Unity Pub are on a different server that's still running fine. So, I assume we can reinstall CUCM and Unity subs with all the same info as before (IP, etc) and the publisher will sync all the data back to these servers.
05-15-2024 01:24 PM
05-15-2024 01:58 PM
Thank you. This brings up another question. We don't really need the extra disk space, so we're thinking of assigning the first 5 disks (that were part of virtual drive 0) as 'hot spares'. This all got me thinking about ESXi itself (VMware), do you know WHERE that is installed? I tried to do some research and find out, but all I can tell so far is that (I think) ESXi is installed on one (or more?) of the hard drives. Is that how Cisco usually does it? (because I read in some cases, maybe not Cisco, it can be installed on a SD card in the server....but I don't think this server has any SD cards installed) Do you know where/which disk(s) Cisco normally installed ESXi on for these UCS servers? I want to be very careful not to break that. Or is it truly not on the hard drives?
Also wondering why Virtual Drives 1, 2 and 3 all show 'Cache Degraded' as the status and if that's related to Virtual Drive 0 being down because of our ongoing HDD issue in slots 1 and 2?
05-16-2024 06:18 AM
Assigning 5 disks as hot spares is a little bit overkill. Maybe you can dedicated 1 or 2. This is completely up to you, but I have never seen 5 hot spares.
You would have to check the CIMC boot order to determine where your server is booting from. If the server still boots to ESXi, its not installed to VD0. If it failed to boot after your recent disk issue, its probably installed to VD0. The boot order is available in the Compute tab of the CIMC and you can see where ESXi is installed.
The cache degraded error is because you have a bad BBU/supercap or potentially you are missing the cache on the RAID controller. Depends if it ever worked before or not.
05-16-2024 06:53 AM
OK. Thanks for the advice.
Here's what I found on the server with the problem. See screenshots below. What is 'Bus 04 Dev 00' as it relates to a HDD? Is it a specific disk? We haven't tried rebooting the server since this happened and I'm afraid to do it now
I looked at one of our other servers that has no problem (C240 M5) and noticed the first entry in its list is 'Bus 18 Dev 00'
Two screenshots from the bad server. One with hovering the mouse over 'Bus 04 Dev 00'
05-16-2024 07:02 AM
Bus 04 Dev 00 is the RAID controller. You are booting from the one of the virtual disks (likely VD0).
If you reboot and ESXi loads up, you are booting from one of the other VD that are not offline.
05-16-2024 07:12 AM
That's what I was afraid of. I was wondering if Bus 04 is actually disk 4. If so, that disk is in VD0 and so far that disk is still online/online for the status/state. My fear was that if we delete and recreate VD0, that will wipe out ESXi also. Is that true? Sorry for my ignorance on the topic but I just want to be sure.
Also, if we rebooted it as it is now, are you saying that if ESXi is installed on either disks 3, 4 or 5 (disks in VD0 that are still online), it still wouldn't boot because VD0 has a problem or would it still boot as long as the actual disk ESXi is installed on is still online (disk 3 - 5)?
05-16-2024 07:31 AM
If VD0 is where the OS is installed and its offline, ESXi is just running in memory anyway. Try to copy down any information that you may need for reinstall, like IP addresses, etc. If VMware is installed to the offline disk, its too late anyway. Once you reboot it wont come back up. If you delete VD0 it will wipe out the OS, but if the VD is offline, there is really no other option.
If ESXi is installed on VD1 VD2 or VD3, when you reboot, ESXi will come back up. I do not think this is the case, usually the OS is installed on VD0. The VD are a group of the physical disks. So VD0 is probably physical disks 1,2,3,4 and VD1 is physical disks 3,4,5,6 etc.
05-16-2024 07:46 AM
Thanks. I really appreciate it. In ESXI (since it's still running, probably in memory as you mentioned), I looked at Config > Storage Devices and I can't see any partition info for the one related to datastore 1 and I'm sure that's because VD0 is offline. For the other devices related to Datastore 2, 3 and 4, I see only 1 partition and from what I understand (looking at a working server), I should expect to see multiple partitions for whatever device/datastore has ESXi installed.
Anyway, looks like we'll have to reinstall it. That brings up one more question. For the VMs that were on datastore 1/VD0 which is down, I rebuilt those VMs on another datastore. If we were to delete and recreate VD0 which will likely wipe out ESXi (maybe gone already as you mentioned), once we reinstall ESXi, would datastores 2, 3 and 4 still be there and working fine with the VM files still in place so that we would hopefully just reimport/add those VMs back to the inventory and be OK? assuming we need to redo all the vswitch setup and so on...all the things we originally had in ESXi
05-16-2024 07:59 AM
Once ESXi is re-installed, you should be able to see or import the datastores from the other VD that are on the server into vSphere. Once you can see those datastsores in ESXi you can add the VM back to inventory and boot them back up. Here is the process:
https://knowledge.broadcom.com/external/article?legacyId=1006160
05-16-2024 08:26 AM
OK, good. That's what I was hoping. So now, it's really just a matter of (tell me if you agree)...
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide