cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3217
Views
0
Helpful
6
Replies

UCS C220 M3 Spontaneous shutdowns without errors nor warning

Fallstop
Level 1
Level 1

Hello,

I have an old Cisco UCS C220 M3 server which was stable for a couple weeks and has become increasingly unstable, to the point it no longer survives 10 minutes booted into an OS. There are no patterns to diagnose, just a very sudden power off that happens at no particular point, even during post, and a suspicious lack of errors. The status shows as fully healthy with everything listed, and absolutely nothing logged except for one "normal" message being spammed in the system event log. This is pretty concerning because CIMC should log something on shutdown no matter what.

 

"normal" message being spammed: "MAIN_POWER_PRS: Presence sensor, Device Removed / Device Absent was asserted"

Here is the idea of the spam quantity:

Screenshot 2021-03-23 174604.png

Inventory:

https://pastebin.com/vXTuPkk3

Bios and CIMC are updated to the latest 3.x, other devices are also updated to there respective versions.

 

BTW, I'm attempting to run such an old system because I picked up this server + part server for 200USD.

 

Thanks in advanced.

6 Replies 6

Kirk J
Cisco Employee
Cisco Employee

Normally, TAC case would take a look at support bundle, and look at other logs in addition to the SEL logs, for evidence of power issues, thermal shutdown, OS triggered shutdown/crash.

You might want to try reseating components, DIMMs, CPUs, PCI-E cards, PSUs, etc

Some of the CIMCs (don't have a M3 to be able to check), have ability to record both the startup/reboot sequence, as well as 'Crash recording', which may capture console output/errors as this is going down.  On M5s, this is under CIMC's 'Compute/troubleshooting' area.

 

Kirk...

I have looked at the console logs, and no errors are logged OS side, I have watched it crash/power off multiple times on the KVM and it just flashes to the No Signal screen.

 

I've only reseated the power supply so far, I will also reseat the RAM and test each CPU individualy.

Reseating and testing all the ram and CPU's did not help, tested a third power supply that had the same problem, leaving it to most likely the motherboard, which is the only part I don't have a working spare of. Also, the problem is rapidly degrading and on average it no longer gets past the post screen most of the time.

Did you ever solve your problem? Facing a similar issue with a c220-m5 where it randomly shuts off and there are no significant messages in the logs. 

If you're experiencing this issue on a C220-M5 server, I recommended opening a TAC case as the M5's are still within support. 

Fallstop
Level 1
Level 1

Nope

I now just have the server parts split across some others I picked up, never managed to source a replacement motherboard to test.

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card