B250 blades with DIMM status as degraded and operability as inoperable

mtimm · ‎10-07-2010

Prior to UCS release 1.3(1c) it was possible for a warm reboot of an OS to cause spurious DIMM errors on the B250 blades and those errors would be incorrectly reported by UCSM and the CLI. A warm reboot is defined a reboot where the OS is restarted without the power being cycled.

There is usually nothing wrong with the memory. The actual problem is CSCtd37817 which has been addressed in 1.3(1c). There are a couple of ways to try and determine if you are running into this bug:

Are you running a version prior to 1.3(1c)?
Does the BIOS see all of the memory as installed? If it does, then the memory is not actually degraded. However, if the BIOS sees the memory as "Failed" then the memory likely does have an actual problem.
If going into the BIOS is not an option, you can check the total memory as seen by the current operating system. If the total memory seen by the operating system is equal to the expected physical memory, then the memory is fine. However, if the total memory seen by the operating system is not equal to the expected physical memory then the memory does have an actual problem.

If the memory is degraded as a result of CSCtd37817, the system can be upgraded to 1.3(1c) or later to clear the state and prevent it from happening again. If an upgrade is not possible right away and you want to clear the state, you can reset the CIMC by following this procedure after connecting to the fabric interface CLI:

scope server x/y
scope bmc
reset
commit-buffer

This will reset the CIMC and will not impact the OS running on the blade.

Another way to clear the state would be to reseat the blade.

This bug does not impact the B200 blades.