08-15-2012 04:48 AM - edited 03-01-2019 10:34 AM
Hi Guys,
Just trying to get to the bottom of what this requires to fix.
I understand what it''s telling me and I was just going to reset the CIMC but on investigation I am a little confused..
It states in the early paragraphs that once the memory is degraded it will no longer get re-evaluated until changed even if you perform a CIMC reset - but then later states that you can indeed force re-evaluation by resetting the CIMC ??
So are they saying that once you see this error the threshold has been reached and you need new RAM as the current RAM has performed below expectations - or reset the CIMC and see if it breaches the threshold again (as it has been re-set) ?
My worry is that I reset the CIMC but the ECC threshold is no longer being evaluated and the DIMM fails fully.
Steve.
08-15-2012 04:56 AM
Just checking the events now and the ECC errors are massive - New DIMM required.
08-15-2012 10:28 AM
Hi Steven,
Just to clarify: reset memory errors does not equal reset CIMC.
Actually resetting CIMC should never be done to clear DIMM errors - doing so is equivalent to sweeping a potential problem under the rug and has a side effect of deleting files in CIMC that may be helpful in investigating the cause of the error.
In 1.3 and earlier firmware resetting CIMC was the easiest way to get UCSM to re-evaluate the DIMM status based on what it was seeing from CIMC (another more impacting method would be to decommision and reack the blade). For errors that do not occur frequently this could result in the DIMM status being reset to operable in UCSM without much impact on the operation of the system but if the error returned what have you accomplished?
This behavior changed in 1.4 firmware and later. In 1.4 and later resetting CIMC has no affect on the DIMM status in UCSM. Once a DIMM goes degraded or inoperable the only way to clear that state in UCSM is to change the FRU information on the DIMM (i.e. replace it), decommision and reack the server (i.e. the server starts over from scratch) or use the reset memory errors functionality.
Reset memory errors was added to 1.4 and later firmware because in 1.3 firmware, UCSM essentially ignored correctable errors. During testing of upgrades from 1.3 to 1.4 it was found that if a system had many correctable errors that occurred long ago, once UCSM was upgraded it would suddenly see all those historical correctable errors as new ones and set the DIMM status to degraded. Reset errors was added to clear that specific condition as well as clear any other false positive DIMM degraded or inoperable status. Use of reset errors outside of this context is similar to resetting CIMC - sweeping a potential problem under the rug.
Regarding your specific problem - if the number of correctable errors continues to increase then yes, the recommended course of action would be as you suggest - replace the DIMM.
Hope that helps,
Mike
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide