UCS C250 - First Memory Voltage problems. Now Power Supply failures

dsachsmusic · ‎04-22-2014

I have a problem with my Cisco UCS C250 M1 Rack Server. Not sure of the cause. Suspect it may have dirty power. I may need power conditioner. But that is just me trying to correlate the problem with what I have – which is the power specs. http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/hw/C250M1/install/c250M1/techspec.html.

The power supply seems to be the problem. The symptoms were originally unreliable voltage in RAM and power supply faults and loss, according the system event log in the CIMC.

I upgraded the firmware of the machine using the Cisco HUU (updates all firmware). Now, the RAM no longer seems to be a problem, but the SEL and LED’s are is still reporting power supply faults and loss.

The System Fault and PSU LED's on the front of the machine show solid amber, after a PSU fault/loss event appears in the SEL - and these LEDs persist until restart.

On the back panel - there is a solid amber Heat Fault LED

To try to troubleshoot whether the indications of faults and losses we valid, I attempted running the server on one power supply or the other. In each case, the server restarts intermittently – corresponding to the time when the SEL reports power supply fault/lost.

Support for the power supply support cannot be purchased, unfortunately. - http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/end_of_life_c51-649352.html

Server was unused for 2011-2014. It had power running into it but was not serving any role in the company. When I started it up – the memory faults were the first problems I found. Actually, my colleague found the problem – his report was the memory fault LED was on and that certain DIMM’s had LED’s light up – so he removed those DIMMs He later put them back though, because I told him memory needed to be populated in sets of 4 DIMMs in this server.

At that point, the SEL was full – but I took screen shots of the critical entries. Here are some of them:
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V2_NIC_STBY: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.064 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (954 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (954 > 864 Watts) was asserted FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (954 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted

I then went ahead and updated the firmware on the server, using Cisco HUU (host update utility). After doing this, and clearing the SEL, the memory fault LED does not light up anymore. Nor do I see RAM events in the SEL. But the power supply problems are still present, as indicated in the SEL and the power supply fault LED on the chassis. Also, as I mentioned, there is a heat fault LED (though there are no Heat Fault events in the SEL - unless they are there and I just don't know how to recognize them.

Here are some of the recurring SEL events (they along with the server restarting when only one power supply is plugged in).
PS_RDNDNT_MODE: PS Redundancy sensor, Redendancy Degraded was asserted
PS_RDNDNT_MODE: PS Redundancy sensor, Redundancy Lost was asserted
FRU_PSU- PSU0_STATUS: Power Supply sensor for FRU_PSU0, Power Supply input lost (AC/DC) was asserted
FRU_PSU1 PSU_STATUS: PowerSupply sensor for FRU_PSU1, Power Supply Failure detected was asserted
PS_RDNDNT_MODE: PS Redundancy sensor, Redendancy Degraded was DEasserted
PS_RDNDNT_MODE: PS Redundancy sensor, Redundancy Lost was DEasserted
FRU_PSU- PSU0_STATUS: Power Supply sensor for FRU_PSU0, Power Supply input lost (AC/DC) was DEasserted
FRU_PSU1 PSU_STATUS: PowerSupply sensor for FRU_PSU1, Power Supply Failure detected was DEasserted

I did get one other interesting alert in the SEL after running HUU (updating the firmware):
IRQ_P2_RDIM_EVNT: Processor sensor, Limit Exceeded was asserted.
But that was a one-time event.

Finally, I noticed there is an amber LED on the mobo near header P35.

I hope somebody can help me. Maybe if they have had a similar problem. I am not finding an answer the Cisco manuals.

Server Profile
Dual 750 Watt power supply
96 GB RAM
2 CPU's (Xeon 5550)
No Cisco VIC
RAID with LSI SAS3081E-R card
CIMC FIRMWARE is at 1.4(3)
Power Supply Firmware – Z1.00.20

One more thing –
I eyeballed the logged events in the CIMC and pulled a couple – mainly based on the keywords “power” and “power supply”:
Notice-BMC:pmbus_pwrsply_mngr:567- pmbus_pwrsply_mngr.c:970:Power Supply 0x1, version Z1.00.20, Successfully changed PWM to 0x1E00. Old value=
Notice-BMC:bioscom:--lv_mode_dimm_support.c:217:[transition_function]Transition to [High Voltage Mode] Success
Notice-BMC:pmbus_pwrsply_mngr:567- pmbus_pwrsply_mngr.c:1258:Power Supply 0x1 Presence Changed to present.
Warning-BMC:kernel:--<4>[platform_power_state_irq_handler]:19:Platform is LA: Asserted
Debug-BMC:bioscom-pwrcap_set_pwr_ctrl.c:71:[pwrcap_read_config_file]Power Cap configuration (read) - PW:0 CW:0 NCA:3 Enable:0
Notice-BMC:kernel:--<5>[pilot2_power_init]:814:INIT Blade Power State is ... [ OFF ]
-Warning-BMC:kernel:--<4>Memory policy: ECC disabled, Data cache writeback

One last thing – the manual does recommend pulling the power supply out and pushing it back it as a step in problem solving. I did not do that yet. If I do that and the events/LEDs stop appearing, can I assume maybe that the problem was a fluke? Have other people solved problems like this by just reinserting the PSU?

Thanks again.

Keny Perez · ‎04-22-2014

Hello,

*Sensors P1VX_IOH is about a power problem on the motherboard and not the PSUs.

*IRQ_P2_RDIM_EVNT: Processor sensor, Limit Exceeded was asserted. << This makes reference to a processor thermal event that is cleared already.

In M1/M2 versions, I remember one needed to pull both of the PSU cords for about 10-15 seconds to complete the upgrade, you may want to give it a try too, to make sure that the upgrade is 100% completed; if after this the behavior is still present, you may want to check if MAYBE the motherboard is still under warranty somehow.

Please rate ALL helpful answers and mark the question as correct if it solves your problem.

-Kenny

dsachsmusic · ‎04-22-2014

I am going to try that right now - re-seating the power supplies.

I was about to change the power sources to UPSs with better filtering than the UPSs the PSUs are connected to right now, but that can wait. I had only just discovered these higher end UPSs are available for me - and I will use them as a best practice however this trial goes.

dsachsmusic · ‎04-22-2014

You said cords, not PSUs I realize. Still I believe I have had both cords unplugged for 15 seconds since the firmware upgrade, so I will try re-seating the PSUs as said.

You don't think I need to run the HUU again and then pull the cords in sequence, do you?

dsachsmusic · ‎04-22-2014

I just re-seated the power supplies and fired up the server and the CMIC console for the first time in a long time does not indicate a fault upon startup!

If we get a day or two without the PSU failure problem I think this will be problem solved.

Keny Perez · ‎04-22-2014

I am glad to read that, keep me posted...

Please rate ALL helpful answers and mark the question as correct if it solves your problem.

-Kenny

dsachsmusic · ‎04-23-2014

Unfortunately I arrived this morning to find a couple FRU_PSU1 PSU_STATUS: PowerSupply sensor for FRU_PSU1, Power Supply Failure detected was asserted;" " DEasserted events in the log.

Also, that amber LED at header P35 never went away.

Keny Perez · ‎04-23-2014

Can you paste a few lines of the SEL log here? Sometimes those asserted> deasserted> asserted messages are tricky :)

-Kenny

dsachsmusic · ‎04-23-2014

This brings us to the present...(i.e. no SEL events since 3:46:29 today - note the server is on west coast time, though I am EST, so 6:47:29 my time).

2014-04-23 03:47:29 Informational LED_HLTH_STATUS: Platform sensor, GREEN was asserted

2014-04-23 03:47:29 Informational LED_PSU_STATUS: Platform sensor, GREEN was asserted

2014-04-23 03:47:29 Normal FRU_PSU1 PSU1_STATUS: Power Supply sensor for FRU_PSU1, Power Supply Failure detected was deasserted

2014-04-23 03:47:29 Normal PS_RDNDNT_MODE: PS Redundancy sensor, Fully Redundant was asserted

2014-04-23 02:51:46 Critial PS_RDNDNT_MODE: PS Redundancy sensor, Redundancy Lost was asserted

2014-04-23 02:51:40 Informational LED_HLTH_STATUS: Platform sensor, AMBER was asserted

2014-04-23 02:51:39 Informational LED_PSU_STATUS: Platform sensor, AMBER was asserted

2014-04-23 02:51:39 Critical FRU_PSU1 PSU1_STATUS: Power Supply sensor for FRU_PSU1, Power Supply Failure detected was asserted

2014-04-22 17:33:06 Informational LED_HLTH_STATUS: Platform sensor, GREEN was asserted

2014-04-22 17:33:06 Informational LED_PSU_STATUS: Platform sensor, GREEN was asserted

2014-04-22 17:33:06 Normal FRU_PSU1 PSU1_STATUS: Power Supply sensor for FRU_PSU1, Power Supply Failure detected was deasserted

2014-04-22 17:33:06 Normal PS_RDNDNT_MODE: PS Redundancy sensor, Fully Redundant was asserted

2014-04-22 16:43:36 Informational LED_HLTH_STATUS: Platform sensor, AMBER was asserted

2014-04-22 16:43:36 Informational LED_PSU_STATUS: Platform sensor, AMBER was asserted

2014-04-22 16:43:35 Critical FRU_PSU1 PSU1_STATUS: Power Supply sensor for FRU_PSU1, Power Supply Failure detected was asserted

2014-04-22 16:43:35 Critical PS_RDNDNT_MODE: PS Redundancy sensor, Redundancy Lost was asserted

2014-04-22 11:01:42 Normal OEM event data record, Record type: DC, Sensor Number: DD, Event Data: 56 53 00, Raw SEL: 37 01 00 00 DD 4B 56 53 00

2014-04-22 11:01:42 Normal System Software event: OS Event sensor, C: boot completed was asserted

2014-04-22 11:00:30 Normal BIOS_POST_CMPLT: Presence sensor, Device Inserted / Device Present was asserted-

2014-04-22 11:00:25 Normal System Software event: System Event sensor, OEM System Boot Event was asserted

Keny Perez · ‎04-23-2014

Ok, there are no more SEL logs because nothing has happened or cause it is full? If it is full, just backup the info (if you want) and clear it to continue logging more info.

Question... have you also tried other PSU cords? If yes, I still think there is a motherboard issue.

-Kenny

dsachsmusic · ‎04-24-2014

Hi Kenny,

The logs are not full - that was just the logs from when I restarted to the present.

I did replace one cord because it was rated for 10A and I saw in the manual that 13A power cords are required. I did not try swapping out the other. I could try to find two much heavier gauge cords and try using those.

Also, I still need to try the UPS with better filtering.

I guess neither of those work, and given the amber LED at header P35, we will assume it is a motherboard issue.

Keny Perez · ‎04-24-2014

It would have been worth it to have a Failure Analysis done for this server (motherboard and PSU) but as you mentioned the PSUs are now EOL and no more EFA for them, same as for the motherboard since March 3, 2012 (http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/end_of_life_c51_634854.html)

I doubt it, but try to see if you at least still have the warranty to request hardware troubleshooting and see if an RMA can solve the situation, but I highly doubt it will be under warranty.

I hope I helped somehow :)

-Kenny

dsachsmusic · ‎04-24-2014

So somebody on communities.cisco.com has told me the problem I am having is a known issue, and that I can resolve it by replacing the EOL power supplies.

https://communities.cisco.com/message/151493#151493

(https://tools.cisco.com/bugsearch/bug/CSCtj65645)

Keny Perez · ‎04-22-2014

I dont think so... it actually used to say that you needed a whole AC power cycle to complete the upgrade, did you ever see it?

Otherwise, try what mentioned about the UPSs and see... the logs still say that the issue is a power problem in the motherboard though.

Please rate ALL helpful answers and mark the question as correct if it solves your problem.

-Kenny

dsachsmusic · ‎04-23-2014

I do not remember the HUU instructing me to remove the power cords - and I also read the instruction manual - so yes, re-rerunning the HUU is probably not necessary. I will try moving to a better UPS.