Re: UCS B200 M3 ESXi 6.0 PSOD

docsystems · ‎01-05-2019

We have a UCS B200 M3 blade running ESXi 6.0 that suddenly crashed with a PSOD. The server is running 2x E5-2690v2 and 384GB of memory. During the time of the crash, the SEL logs indicated memory issue. However, I just recently upgraded the CPUs from V1 to V2 and maxed out the DIMMs so I don't want to jump to conclusions yet.

65a | 01/05/2019 13:44:21 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU2 DIMM F2. | Asserted 
65b | 01/05/2019 13:44:21 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU2 DIMM F2. | Asserted 
65c | 01/05/2019 13:44:22 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted 
65d | 01/05/2019 13:44:22 | CIMC | Platform alert LED_BLADE_STATUS #0x95 | LED color is amber | Asserted 
65e | 01/05/2019 13:44:23 | CIMC | Processor CATERR_N #0x70 | Predictive Failure deasserted | Asserted 
65f | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F0_ECC #0x81 | read 14254 correctable ECC errors on CPU2 DIMM F0  | Asserted 
660 | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F1_ECC #0x82 | read 8944 correctable ECC errors on CPU2 DIMM F1  | Asserted 
661 | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F2_ECC #0x83 | read 9684 correctable ECC errors on CPU2 DIMM F2  | Asserted 
662 | 01/05/2019 13:44:25 | CIMC | Memory DDR3_P2_F2_ECC #0x83 | Upper Non-recoverable - going high | Asserted | Non-Correctable ECC occurred on this DIMM 
663 | 01/05/2019 13:44:25 | CIMC | Processor MCERR #0x98 | Predictive Failure asserted | Asserted 
664 | 01/05/2019 13:44:27 | CIMC | Processor MCERR #0x98 | Predictive Failure deasserted | Asserted 
665 | 01/05/2019 13:46:11 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted 
666 | 01/05/2019 13:46:12 | CIMC | Processor CATERR_N #0x70 | Predictive Failure deasserted | Asserted 
667 | 01/05/2019 13:46:12 | CIMC | Processor IERR #0x99 | Predictive Failure asserted | Asserted

The firmwares I am running as the following:

BIOS: B200M3.2.2.6d.0.062220160055

Board Controller: 15.0

CIMC: 3.1(21a)

SAS Controller: 20.12.1-0250|4.37.00|NA

VIC: 4.1(1g)

Looking for some input on what if anything else could be the cause for the crash. I have multiple other servers running the exact same hardware setup and same firmware levels so I want to rule out any possible configuration issues as well.

Kirk J · ‎01-06-2019

Greetings.

I'm guessing the DIMM errors are probably red herrings.

It's likely there is a problem with the systemboard, or processors.

You may want to double-check the processor socket pins that are actually on the motherboard to make sure none got bent during the proc upgrades/replacements, as I'm assuming this systemboard had a previous clean track record...

If you don't see any bent pins, try pulling one proc (and related dimms) out, and run the diag iso.

Repeat with other CPU in single CPU config with diag ISO to see if this triggers again with a specific proc.

Else, open a TAC case.

Kirk...