10-16-2020 10:37 AM - edited 10-16-2020 10:40 AM
Our HX clusters are updated 4.0(2c) and some on 3.5(2h) matrix. All flash
- with no reason DDR4 going bad left & right, in some case we have outages too. Not sure anything configured wrong. any help will be very helpful.
By the way best Hyper-converged solution. I loved HyperFlex. Just need some idea/help
e.g.
09/26/2020 02:47:08 UTC | CIMC | Memory DDR4_P1_A2_ECC #0xd5 | read 4 correctable ECC errors on CPU1 DIMM A2 | Asserted
09/26/2020 02:47:13 UTC | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU1 DIMM A2. | Asserted
09/26/2020 02:47:13 UTC | CIMC | Processor P_CATERR #0x03 | Predictive Failure asserted | Asserted
09/26/2020 02:47:13 UTC | BIOS | Processor #0x00 | Configuration Error | | Asserted
###
# 80 01 00 00 01 02 00 00 06 15 43 5E 20 00 04 0C E5 00 00 00 7E 86 58 29 # 180 | 02/11/2020 20:56:38 UTC | CIMC | Memory DDR4_P2_J2_ECC #0xe5 | read 22662 correctable ECC errors on CPU2 DIMM J2 | Asserted
Solved! Go to Solution.
10-21-2020 11:21 AM
Both those versions have the correct reporting for the DIMM slot actual triggering errors.
You do want to get off of 4.04h, as that version was deferred, and can cause the servers to hang at post if any ECCs are being detected.
That issue is fixed in 4.04i and later.
Ultimately TAC will need to take a holistic view of what the DIMMs have in common, if anything.
Kirk...
10-21-2020 06:56 AM
What UCS firmware level are you on?
There have been some recent firmware fixes to fix some DIMM and related error reporting issues such as: CSCvo48003, CSCvu14656.
Kirk...
10-21-2020 07:11 AM - edited 10-21-2020 07:23 AM
Some on HX 3.5(2H) & most of them on 4.0(2c) Matrix
UCS on 4.0(4h) & 4.0(4i)
Memory faults occurred during Production time not during reboot
Server Model we use in Hyper-flex environments
HXAF240C-M5SX
UCSC-C240-M5SX
UCSC-C220-M5SX
10-21-2020 11:21 AM
Both those versions have the correct reporting for the DIMM slot actual triggering errors.
You do want to get off of 4.04h, as that version was deferred, and can cause the servers to hang at post if any ECCs are being detected.
That issue is fixed in 4.04i and later.
Ultimately TAC will need to take a holistic view of what the DIMMs have in common, if anything.
Kirk...
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide