Solved: Re: UCS C240 - Pink screen of death

Jasmine Misal · ‎11-24-2017

On Tuesday we got pink scree of death on our ESXi host. We tried consoling into the host and it rebooted with this pink screen below. Could someone tell me how to figure out whats happened or how to read this? Thanks

Model: UCSC-C240-M45SX

ESXi 5.5.0

BIOS version: C240M4.3.0.3a.0.0321172111

Qiese Dides · ‎11-27-2017

Hi Jasmine,

PSODs (Purple Screen of deaths) can always be tricky to troubleshoot. Following these steps can help you get to the correct answer :). At the end of this post I will upload a screenshot of how to read a PSOD.

1) Proper Log Collection

(Gather Screenshot - Which you did). Based on the screenshot we can see the following KB Articles from VMware regarding this, this gives us an idea what PF Exception 14 will do. We will need more information though

https://kb.vmware.com/s/article/1020181

* We will need to gather an ESXi log bundle!! (Below is how to do this)

An ESXi log bundle is a .tgz file generated by the automated log collection process

See KB 653 for the log collection process

NO VCENTER LOGS!!!!!!!!!! - VC logs are irrelevant to a crash. All they report is “Host was there and now it’s gone!”

* Once you gather the log bundles and you want to open a case with Cisco or VMware to find the root cause have these answers filled out when opening the case (It will get you a speedy resolution):

How widely spread is the issue? One host? Two? All hosts in the cluster? Only the new hosts?

When did the issue start? Just now, last week, or since install?

How often does this issue occur? Every day? Every week? Just this one time?

Any changes to the host or the environment recently?

Have you already run hardware diagnostics? If so, what was the result?

Was there any specific action that led to the crash or was it just sitting there?

Key for the screenshot above:

1.ESXi version and build

2.Exception and/or failure message

3.PTEs (only shown w/ exception 14 type crash)

4.CPU register info

5.PCPU & world generating the crash

6.Uptime

7.Address of frame in memory

8.Address of code in memory

9.The backtrace

10.Dump to disk is configured

11.Status of DiskDump

12.Dump to file is not configured

13.Availability of local debugging

Break down of the Purple Screen of Death error message:

The exception or failure message (#2)
This is the first piece of information to look at for PSODs
This tells us the nature of the crash and sometimes explicitly why it crashed
For a description of exception types: http://support.microsoft.com/en-us/kb/117389
Other common crashes will be covered later
Our example crash is an exception 14 (page fault) which is very common
The fabled backtrace (#9, aka stacktrace)
This is a list of functions running on the world (aka thread) that caused or experienced the crash
The function at the top of the stack is the function currently running
Going down the stack is like going backwards in the process’s execution
It shows how we ended up here
The backtrace is what we typically use to match a crash to a known issue. Think of it as the fingerprint of the crash.

Ex: Line at the top of our sample stack

0x4123c111db10:[0x4180262f8abb]LibAIODrainMergeQueue@vmkernel#nover+0x153 stack: 0x123c111db60

Function names are between the code address and the ‘@’ symbol

At the end of the day Cisco will use this information to analyze and look for hardware faults (Memory, CPU, Motherboard) failures. You also want to make sure your drivers (FNIC / ENIC) on the operating system are always up to date. You can find these drivers versions from going to the link below and navigating.

https://ucshcltool.cloudapps.cisco.com/public/#

Finding the root cause of a PSOD will require log bundles. If you do open a Cisco TAC case let me know and I will be happy to assist you.

If this post helped you please mark it as correct so other members are able to reference the information given here.

View solution in original post

Walter Dey · ‎11-25-2017

Questions

- is this a standalone or UCS managed server

- which UCS version

- which enic/fnic driver version on ESXi ?

- is it the only server crashing, or do you have others as well ?

- does the crash happen after some time, or immediately after reboot

- did you do a recent firmware upgrade ?

Qiese Dides · ‎11-27-2017