cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1349
Views
10
Helpful
3
Replies

Hyperflex Cluster - Multiple DDR4 failing

Nilay Patel
Level 1
Level 1

Our HX clusters are updated 4.0(2c) and some on 3.5(2h) matrix. All flash

 

- with no reason DDR4 going bad left & right, in some case we have outages too. Not sure anything configured wrong. any help will be very helpful.

 

By the way best Hyper-converged solution. I loved HyperFlex. Just need some idea/help

 

e.g.

09/26/2020 02:47:08 UTC | CIMC | Memory DDR4_P1_A2_ECC #0xd5 | read 4 correctable ECC errors on CPU1 DIMM A2  | Asserted

09/26/2020 02:47:13 UTC | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU1 DIMM A2. | Asserted

09/26/2020 02:47:13 UTC | CIMC | Processor P_CATERR #0x03 | Predictive Failure asserted | Asserted

09/26/2020 02:47:13 UTC | BIOS | Processor #0x00 | Configuration Error |  | Asserted

 

###

 

# 80 01 00 00 01 02 00 00 06 15 43 5E 20 00 04 0C E5 00 00 00 7E 86 58 29 # 180 | 02/11/2020 20:56:38 UTC | CIMC | Memory DDR4_P2_J2_ECC #0xe5 | read 22662 correctable ECC errors on CPU2 DIMM J2  | Asserted

 

1 Accepted Solution

Accepted Solutions

Both those versions have the correct reporting for the DIMM slot actual triggering errors.

You do want to get off of 4.04h, as that version was deferred, and can cause the servers to hang at post if any ECCs are being detected.

That issue is fixed in 4.04i and later.

 

Ultimately TAC will need to take a holistic view of what the DIMMs have in common, if anything.

 

Kirk...

View solution in original post

3 Replies 3

Kirk J
Cisco Employee
Cisco Employee

What UCS firmware level are you on?

There have been some recent firmware fixes to fix some DIMM and related error reporting issues such as: CSCvo48003, CSCvu14656.

 

Kirk...

Some on HX 3.5(2H) & most of them on 4.0(2c) Matrix 

UCS on 4.0(4h) & 4.0(4i)

Memory faults occurred during Production time not during reboot

Server Model we use in Hyper-flex environments 

HXAF240C-M5SX

UCSC-C240-M5SX

UCSC-C220-M5SX

Both those versions have the correct reporting for the DIMM slot actual triggering errors.

You do want to get off of 4.04h, as that version was deferred, and can cause the servers to hang at post if any ECCs are being detected.

That issue is fixed in 4.04i and later.

 

Ultimately TAC will need to take a holistic view of what the DIMMs have in common, if anything.

 

Kirk...