01-05-2019 03:12 PM
We have a UCS B200 M3 blade running ESXi 6.0 that suddenly crashed with a PSOD. The server is running 2x E5-2690v2 and 384GB of memory. During the time of the crash, the SEL logs indicated memory issue. However, I just recently upgraded the CPUs from V1 to V2 and maxed out the DIMMs so I don't want to jump to conclusions yet.
65a | 01/05/2019 13:44:21 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU2 DIMM F2. | Asserted 65b | 01/05/2019 13:44:21 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU2 DIMM F2. | Asserted 65c | 01/05/2019 13:44:22 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted 65d | 01/05/2019 13:44:22 | CIMC | Platform alert LED_BLADE_STATUS #0x95 | LED color is amber | Asserted 65e | 01/05/2019 13:44:23 | CIMC | Processor CATERR_N #0x70 | Predictive Failure deasserted | Asserted 65f | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F0_ECC #0x81 | read 14254 correctable ECC errors on CPU2 DIMM F0 | Asserted 660 | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F1_ECC #0x82 | read 8944 correctable ECC errors on CPU2 DIMM F1 | Asserted 661 | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F2_ECC #0x83 | read 9684 correctable ECC errors on CPU2 DIMM F2 | Asserted 662 | 01/05/2019 13:44:25 | CIMC | Memory DDR3_P2_F2_ECC #0x83 | Upper Non-recoverable - going high | Asserted | Non-Correctable ECC occurred on this DIMM 663 | 01/05/2019 13:44:25 | CIMC | Processor MCERR #0x98 | Predictive Failure asserted | Asserted 664 | 01/05/2019 13:44:27 | CIMC | Processor MCERR #0x98 | Predictive Failure deasserted | Asserted 665 | 01/05/2019 13:46:11 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted 666 | 01/05/2019 13:46:12 | CIMC | Processor CATERR_N #0x70 | Predictive Failure deasserted | Asserted 667 | 01/05/2019 13:46:12 | CIMC | Processor IERR #0x99 | Predictive Failure asserted | Asserted
The firmwares I am running as the following:
BIOS: B200M3.2.2.6d.0.062220160055
Board Controller: 15.0
CIMC: 3.1(21a)
SAS Controller: 20.12.1-0250|4.37.00|NA
VIC: 4.1(1g)
Looking for some input on what if anything else could be the cause for the crash. I have multiple other servers running the exact same hardware setup and same firmware levels so I want to rule out any possible configuration issues as well.
01-06-2019 05:29 AM
Greetings.
I'm guessing the DIMM errors are probably red herrings.
It's likely there is a problem with the systemboard, or processors.
You may want to double-check the processor socket pins that are actually on the motherboard to make sure none got bent during the proc upgrades/replacements, as I'm assuming this systemboard had a previous clean track record...
If you don't see any bent pins, try pulling one proc (and related dimms) out, and run the diag iso.
Repeat with other CPU in single CPU config with diag ISO to see if this triggers again with a specific proc.
Else, open a TAC case.
Kirk...
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide