cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
12650
Views
10
Helpful
9
Replies

P_CATERR_N Processor Error

zink000011
Level 1
Level 1

Hello,

 

We had a Server Shutdown on UCS C220.

In the IMC Console we see this

 

[F0174][critical][equipment-inoperable][sys/rack-unit-1/board] P_CATERR_N: A catastrophic fault has occurred on one of the processors: Please check the processors' status.

P_CATERR_N: Processor sensor, Predictive Failure asserted

 

I searched web for

P_CATERR_N, but could not find anything.

 

What is wrong ? Any idea ? Thanks a lot

 

Armin

9 Replies 9

Ashok Kumar
Cisco Employee
Cisco Employee

Hi,

Well, this is not good sign. I have seen this ERROR leads to resolution as either firmware upgrade or RMA. Please open a TAC case for it.

 


- Ashok

******************************************************************************************************

Please rate the post or mark as correct answer as it will help others looking for similar information

******************************************************************************************************

 

 

 

Keny Perez
Level 8
Level 8

P_CATERR-N means a Processor Catastrophic Error on your server... Sometimes this errors show up during server POST and then go away the next second; so the best advice is to open a TAC case and see if your crash matches the time CATERR error in the logs so we can tell you if that is the real cause of the reboot/shutdown.

 

-Kenny

Anyone ever have success working around/through this? Running Cisco ESXi 6.5 on our UCSC-C240-M3S with BIOS Version: C240M3.3.0.3a.0 (Build Date: 03/15/17)

This is almost certainly a cause of attempting to passthrough a single PCI device -- nVidia Quadro 2000 (we have three equipped, just trying to get a single GPU). Thank you kindly for your attention to our little matter.

CIMC shows a EQUIPMENT_INOPERABLE Fault [0174][critical][equipment-inoperable][sys/rack-unit-1/board] P_CATERR_N: A catastrophic fault has occurred on one of the processors: Please check the processors' status... which is then immediately resolved upon power cycling the machine.  Unfortunately, it takes everything down with it.  The single node is brought to its knees.

I have attempted to refer to this documentation:  https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/sw/fault/reference/guide/Cisco_UCS_C-Series_Servers_CIMC_Faults/CIMC_Faults.html

#1 mage.

Definitely related to PCI Passthrough on Vmware-ESXi-6.5d.0-5310538-Custom-Cisco-6.5.0.3 (CISCO) -- still trying to work around/through this for our VDI testing.
#1 mage.

@peeat I'm facing the same issue. Could you solved your issue. 

No sir, @Ali Amir I was never able to successfully work through this.  I was able to get a Windows 10 VM to brief support the same/similar (single Quadro 2000 -- my Cisco rack server has three of these installed in it, perfect for VDI deployment) in a custom HPE 6.7 ESXi build, but that only lasted until I rebooted the VM, now I'm getting the entire host hanging and other odd behaviors.  If I wait a very long time, the machine finally comes up and I can access it remotely, but if you were to stand there locally and watch the screen, the progress bar gets stuck and never appears to finish loading.

Trying to update firmware on my C240-M3S to latest version 3.0(4j) and hopefully try again with CISCO Custom Image for ESXi 6.7 U1 GA.  Will keep you posted if I make any progress, sir.  Please do the same if you have found a resolution.  Hate having three GPUs in this machine taking up space and wasting electricity with no ability to properly utilize them.  I know they are working find, as I booted the machine into Windows and was able to apply drivers and test all three cards independently, so it's definitely an issue with either VMware's product, or more likely, my configuration.

 

EDIT:  I can tell you that simply rebooting the host resolves the "catastrophic" failure, but boy that certainly scared me the first time I saw it come up and wasn't sure what I had done, if I had truly messed up.  Thankfully it was just related to the PCI Passthrough.  Will be looking to invest some time in the coming days/weeks to hopefully give it another go and see if we can work around the issue.

There was this thread that I attempted to refer to, it seemed to have significantly more information than anyone around here was able to provide:  https://forums.servethehome.com/index.php?threads/troubleshooting-gpu-passthrough-esxi-6-5.12631/

#1 mage.

This appeared on one of our servers, however we do not utilize GPU cards within the systems, we are utilizing the latest firmware for the systems (ucs-c240m5-huu-4.0.2f) and ESXi6.5U2 Custom ISO for Cisco (VMware-ESXi-6.5.0-9298722-Custom-Cisco-6.5.2.2).  Before opening a TAC case I wanted to see if anyone came to a resolution. 

@brian-henry my apologies I missed your reply.  Sadly, I wasn't ever able to resolve the issue, despite being able to replicate it on demand -- i got sick of crashing the entire host, so I just gave up and stopped messing with them and the machine has continued to operate wonderfully (as long as I'm not tinkering with PCI Passthrough).  I've heard others talk about temperature sensors going haywire and it causing random issues, but I'm hardly an expert on such things.

Hopefully in the months since you've been able to resolve your issue(s).

#1 mage.

SulemanKhalil
Level 1
Level 1

I am facing the same issue on HX 220C M5SX Firmware Version: 4.1(1d)

 

 
2022-07-14 14:42:18 Warning P_CATERR: Processor sensor, Predictive Failure asserted was asserted
 
along with this error, I am also getting a few other warning messages below.
 
2022-07-14 14:42:19 Warning MCERR: Processor sensor, Predictive Failure asserted was asserted
2022-07-14 14:42:23 Warning IERR: Processor sensor, Predictive Failure asserted was asserted
2022-07-14 14:42:23 Warning System Software event: Node manager sensor sensor, Record type: 02, Sensor Number: 17, Event Data: A0 0D 03, Raw SEL: 2c 60 04 DC 17 75 A0 0D 03 was asserted.