Just trying to get to the bottom of what this requires to fix. I understand what it''s telling me and I was just going to reset the CIMC but on investigation I am a little confused..
It states in the early paragraphs that once the memory is degraded it will no longer get re-evaluated until changed even if you perform a CIMC reset - but then later states that you can indeed force re-evaluation by resetting the CIMC ??
So are they saying that once you see this error the threshold has been reached and you need new RAM as the current RAM has performed below expectations - or reset the CIMC and see if it breaches the threshold again (as it has been re-set) ?
My worry is that I reset the CIMC but the ECC threshold is no longer being evaluated and the DIMM fails fully.
Just to clarify: reset memory errors does not equal reset CIMC.
Actually resetting CIMC should never be done to clear DIMM errors - doing so is equivalent to sweeping a potential problem under the rug and has a side effect of deleting files in CIMC that may be helpful in investigating the cause of the error.
In 1.3 and earlier firmware resetting CIMC was the easiest way to get UCSM to re-evaluate the DIMM status based on what it was seeing from CIMC (another more impacting method would be to decommision and reack the blade). For errors that do not occur frequently this could result in the DIMM status being reset to operable in UCSM without much impact on the operation of the system but if the error returned what have you accomplished?
This behavior changed in 1.4 firmware and later. In 1.4 and later resetting CIMC has no affect on the DIMM status in UCSM. Once a DIMM goes degraded or inoperable the only way to clear that state in UCSM is to change the FRU information on the DIMM (i.e. replace it), decommision and reack the server (i.e. the server starts over from scratch) or use the reset memory errors functionality.
Reset memory errors was added to 1.4 and later firmware because in 1.3 firmware, UCSM essentially ignored correctable errors. During testing of upgrades from 1.3 to 1.4 it was found that if a system had many correctable errors that occurred long ago, once UCSM was upgraded it would suddenly see all those historical correctable errors as new ones and set the DIMM status to degraded. Reset errors was added to clear that specific condition as well as clear any other false positive DIMM degraded or inoperable status. Use of reset errors outside of this context is similar to resetting CIMC - sweeping a potential problem under the rug.
Regarding your specific problem - if the number of correctable errors continues to increase then yes, the recommended course of action would be as you suggest - replace the DIMM.
Just wonder we recently try to add 2 M200 with firmware 4.1.1b, our UCS manager version is 4.0.2d, we got error said cannot donwgrade. I know we can update ucs manager from 4.0 to 4.1, bur wonder we can downgrade
OpenStack Neutron project offers pluggable framework means you can extend the capability of Neutron by orchestrating the Neutron functions to your upstream networking gears. For example, if you have provisioned a VLAN tagged Neutron network in Op...
https://soundcloud.com/user-327105904/s7e25-from-the-office-to-anywhere-empowering-secure-remote-work-with-cisco-vdi-solutions As organizations have had to rapidly respond and transition in the face of swift change, Cisco VDI solutions have enabled ...
In today’s fast paced digitization, Kubernetes enables enterprises to rapidly deploy new updates and features at scale while maintaining environmental consistency across test/dev/prod. Kubernetes lays the foundation for cloud-native apps which ca...
What can we collect before opening a case for Unified Computing System(UCS) ?
Serial Number and Part Details
UCS Server Serial Number : e.g. XXXXXXX in Serial number field. Check UCSM or CIMC
Product ID(PID) o...