cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2313
Views
0
Helpful
4
Replies

VMware Memory Sensor

fr.mueller
Level 1
Level 1

Hello,

we have an issue, where vmware hardware status is showing an memory error.

 

memroy.JPG

 

 

But on my ucs manager, there is no fault, no warning or info. I have rebooted the host but the issue is still there. Any advise where this error is coming from?

Thanks

Frank

1 Accepted Solution

Accepted Solutions

Evan Mickel
Cisco Employee
Cisco Employee

VMware relies on the Cisco Integrated Management Controller to collect data on the DIMMs within the server.  Thus when we see a discrepancy between what VMware reports from its front-end versus what the UCS reports, it is usually related to the CIMC.  One other item to consider before moving forward is that this may be a difference between the way that we report and handle ECC/UECC errors.  As ECC errors on a DIMM are correctable and thus not hardware failures, the UCS may not report faults outwardly even if ECC errors are present and incrementing.

 

As for resolving this/clearing the error, rebooting the host OS/server itself will not reboot the CIMC and vise versa.  The CIMC is what should be rebooted to clear this which would be done differently depending on the hardware version (B-series vs C-series) and I have posted the links below. 

 

https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/sw/gui/config/guide/1-1-2/b_Cisco_UCS_C-Series_Servers_Integrated_Management_Controller_Configuration_Guide_1_1_2/Cisco_UCS_C-Series_Servers_Integrated_Management_Controller_Configuration_...

 

https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-2/b_UCSM_GUI_Configuration_Guide_2_2/managing_blade_servers.html#task_5D5BB478DE454021990B1A9E8CBBDF32

 

If the errors continue to recur, you may want to open a ticket to have the memory array reviewed by TAC.

 

 

 

Thanks!

 

View solution in original post

4 Replies 4

Evan Mickel
Cisco Employee
Cisco Employee

VMware relies on the Cisco Integrated Management Controller to collect data on the DIMMs within the server.  Thus when we see a discrepancy between what VMware reports from its front-end versus what the UCS reports, it is usually related to the CIMC.  One other item to consider before moving forward is that this may be a difference between the way that we report and handle ECC/UECC errors.  As ECC errors on a DIMM are correctable and thus not hardware failures, the UCS may not report faults outwardly even if ECC errors are present and incrementing.

 

As for resolving this/clearing the error, rebooting the host OS/server itself will not reboot the CIMC and vise versa.  The CIMC is what should be rebooted to clear this which would be done differently depending on the hardware version (B-series vs C-series) and I have posted the links below. 

 

https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/sw/gui/config/guide/1-1-2/b_Cisco_UCS_C-Series_Servers_Integrated_Management_Controller_Configuration_Guide_1_1_2/Cisco_UCS_C-Series_Servers_Integrated_Management_Controller_Configuration_...

 

https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-2/b_UCSM_GUI_Configuration_Guide_2_2/managing_blade_servers.html#task_5D5BB478DE454021990B1A9E8CBBDF32

 

If the errors continue to recur, you may want to open a ticket to have the memory array reviewed by TAC.

 

 

 

Thanks!

 

Hello Evan,

 

thanks for your help. But that doesn´t fix the issue. I guess i have to raise a call.

Thanks

Frank

I would suggest a call as well, have the memory array analyzed for errors and replace hardware as necessary.

 

Thanks for the reply and have a great week!  I do hope you get this sorted out in short order.

This was our solution

 

  1. Reset memory-error counters on server 2/3 by running the following script on the CLI:

Ca-1-A# scope server 2/3

Ca-1-A /chassis/server # reset-all-memory-errors

Ca-1-A /chassis/server* # commit

 

  1. Clear the SEL logs for the server from the UCSM GUI
  2. After resetting memory error and clearing the SEL put the server under monitoring for 30 minutes and feed me back if this cleared the errors, or if it appeared again.

 Frank

Review Cisco Networking products for a $25 gift card