12-08-2017 02:35 AM - edited 03-01-2019 01:23 PM
Hello,
we have an issue, where vmware hardware status is showing an memory error.
But on my ucs manager, there is no fault, no warning or info. I have rebooted the host but the issue is still there. Any advise where this error is coming from?
Thanks
Frank
Solved! Go to Solution.
12-08-2017 09:31 AM
VMware relies on the Cisco Integrated Management Controller to collect data on the DIMMs within the server. Thus when we see a discrepancy between what VMware reports from its front-end versus what the UCS reports, it is usually related to the CIMC. One other item to consider before moving forward is that this may be a difference between the way that we report and handle ECC/UECC errors. As ECC errors on a DIMM are correctable and thus not hardware failures, the UCS may not report faults outwardly even if ECC errors are present and incrementing.
As for resolving this/clearing the error, rebooting the host OS/server itself will not reboot the CIMC and vise versa. The CIMC is what should be rebooted to clear this which would be done differently depending on the hardware version (B-series vs C-series) and I have posted the links below.
If the errors continue to recur, you may want to open a ticket to have the memory array reviewed by TAC.
Thanks!
12-08-2017 09:31 AM
VMware relies on the Cisco Integrated Management Controller to collect data on the DIMMs within the server. Thus when we see a discrepancy between what VMware reports from its front-end versus what the UCS reports, it is usually related to the CIMC. One other item to consider before moving forward is that this may be a difference between the way that we report and handle ECC/UECC errors. As ECC errors on a DIMM are correctable and thus not hardware failures, the UCS may not report faults outwardly even if ECC errors are present and incrementing.
As for resolving this/clearing the error, rebooting the host OS/server itself will not reboot the CIMC and vise versa. The CIMC is what should be rebooted to clear this which would be done differently depending on the hardware version (B-series vs C-series) and I have posted the links below.
If the errors continue to recur, you may want to open a ticket to have the memory array reviewed by TAC.
Thanks!
12-18-2017 02:19 AM
Hello Evan,
thanks for your help. But that doesn´t fix the issue. I guess i have to raise a call.
Thanks
Frank
12-18-2017 07:57 AM
I would suggest a call as well, have the memory array analyzed for errors and replace hardware as necessary.
Thanks for the reply and have a great week! I do hope you get this sorted out in short order.
01-25-2018 01:37 AM
This was our solution
Ca-1-A# scope server 2/3
Ca-1-A /chassis/server # reset-all-memory-errors
Ca-1-A /chassis/server* # commit
Frank
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide