cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2010
Views
0
Helpful
1
Replies

UCS B200 M3 ESXi 6.0 PSOD

docsystems
Level 1
Level 1

We have a UCS B200 M3 blade running ESXi 6.0 that suddenly crashed with a PSOD. The server is running 2x E5-2690v2 and 384GB of memory. During the time of the crash, the SEL logs indicated memory issue. However, I just recently upgraded the CPUs from V1 to V2 and maxed out the DIMMs so I don't want to jump to conclusions yet.

 

65a | 01/05/2019 13:44:21 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU2 DIMM F2. | Asserted 
65b | 01/05/2019 13:44:21 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU2 DIMM F2. | Asserted 
65c | 01/05/2019 13:44:22 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted 
65d | 01/05/2019 13:44:22 | CIMC | Platform alert LED_BLADE_STATUS #0x95 | LED color is amber | Asserted 
65e | 01/05/2019 13:44:23 | CIMC | Processor CATERR_N #0x70 | Predictive Failure deasserted | Asserted 
65f | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F0_ECC #0x81 | read 14254 correctable ECC errors on CPU2 DIMM F0  | Asserted 
660 | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F1_ECC #0x82 | read 8944 correctable ECC errors on CPU2 DIMM F1  | Asserted 
661 | 01/05/2019 13:44:24 | CIMC | Memory DDR3_P2_F2_ECC #0x83 | read 9684 correctable ECC errors on CPU2 DIMM F2  | Asserted 
662 | 01/05/2019 13:44:25 | CIMC | Memory DDR3_P2_F2_ECC #0x83 | Upper Non-recoverable - going high | Asserted | Non-Correctable ECC occurred on this DIMM 
663 | 01/05/2019 13:44:25 | CIMC | Processor MCERR #0x98 | Predictive Failure asserted | Asserted 
664 | 01/05/2019 13:44:27 | CIMC | Processor MCERR #0x98 | Predictive Failure deasserted | Asserted 
665 | 01/05/2019 13:46:11 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted 
666 | 01/05/2019 13:46:12 | CIMC | Processor CATERR_N #0x70 | Predictive Failure deasserted | Asserted 
667 | 01/05/2019 13:46:12 | CIMC | Processor IERR #0x99 | Predictive Failure asserted | Asserted 

The firmwares I am running as the following:

BIOS: B200M3.2.2.6d.0.062220160055

Board Controller: 15.0

CIMC: 3.1(21a)

SAS Controller: 20.12.1-0250|4.37.00|NA

VIC: 4.1(1g)

 

Looking for some input on what if anything else could be the cause for the crash. I have multiple other servers running the exact same hardware setup and same firmware levels so I want to rule out any possible configuration issues as well.

1 Reply 1

Kirk J
Cisco Employee
Cisco Employee

Greetings.

I'm guessing the DIMM errors are probably red herrings.

It's likely there is a problem with the systemboard, or processors.

You may want to double-check the processor socket pins that are actually on the motherboard to make sure none got bent during the proc upgrades/replacements, as I'm assuming this systemboard had a previous clean track record...

If you don't see any bent pins, try pulling one proc (and related dimms) out, and run the diag iso.

Repeat with other CPU  in single CPU config with diag ISO to see if this triggers again with a specific proc.

Else, open a TAC case.

 

Kirk...

 

Review Cisco Networking for a $25 gift card