cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements
Field Notice 70545
3846
Views
4
Helpful
17
Replies
dsachsmusic
Beginner

UCS C250 - First Memory Voltage problems. Now Power Supply failures

I have a problem with my Cisco UCS C250 M1 Rack Server. Not sure of the cause. Suspect it may have dirty power. I may need power conditioner. But that is just me trying to correlate the problem with what I have – which is the power specs. http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/hw/C250M1/install/c250M1/techspec.html

The power supply seems to be the problem. The symptoms were originally unreliable voltage in RAM and power supply faults and loss, according the system event log in the CIMC.

I upgraded the firmware of the machine using the Cisco HUU (updates all firmware).  Now, the RAM no longer seems to be a problem, but the SEL and LED’s are is still reporting power supply faults and loss.

The System Fault and PSU LED's on the front of the machine show solid amber, after a PSU fault/loss event appears in the SEL - and these LEDs persist until restart.


On the back panel - there is a solid amber Heat Fault LED

To try to troubleshoot whether the indications of faults and losses we valid, I attempted running the server on one power supply or the other.  In each case, the server restarts intermittently – corresponding to the time when the SEL reports power supply fault/lost.

Support for the power supply support cannot be purchased, unfortunately. - http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/end_of_life_c51-649352.html

Server was unused for 2011-2014. It had power running into it but was not serving any role in the company. When I started it up – the memory faults were the first problems I found. Actually, my colleague found the problem – his report was the memory fault LED was on and that certain DIMM’s had LED’s light up – so he removed those DIMMs  He later put them back though, because I told him memory needed to be populated in sets of 4 DIMMs in this server.

At that point, the SEL was full – but I took screen shots of the critical entries.  Here are some of them:
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V2_NIC_STBY: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.064 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V5_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.344 V) was asserted
FRU_RAM P1V1_IOH: Voltage sensor for FRU_RAM, non-recoverable event, Lower Non-Recoverable going low (0.000 < 1.000 V) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (954 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (954 > 864 Watts) was asserted FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (954 > 864 Watts) was asserted
FRU_RAM POWER_USAGE: Power Supply sensor for FRU_RAM, non-recoverable event, Upper Non-Recoverable going high (1020 > 864 Watts) was asserted

I then went ahead and updated the firmware on the server, using Cisco HUU (host update utility). After doing this, and clearing the SEL, the memory fault LED does not light up anymore.  Nor do I see RAM events in the SEL. But the power supply problems are still present, as indicated in the SEL and the power supply fault LED on the chassis. Also, as I mentioned, there is a heat fault LED (though there are no Heat Fault events in the SEL - unless they are there and I just don't know how to recognize them.

Here are some of the recurring SEL events (they along with the server restarting when only one power supply is plugged in).
PS_RDNDNT_MODE: PS Redundancy sensor, Redendancy Degraded was asserted
PS_RDNDNT_MODE: PS Redundancy sensor, Redundancy Lost was asserted
FRU_PSU- PSU0_STATUS: Power Supply sensor for FRU_PSU0, Power Supply input lost (AC/DC) was asserted
FRU_PSU1 PSU_STATUS: PowerSupply sensor for FRU_PSU1, Power Supply Failure detected was asserted
PS_RDNDNT_MODE: PS Redundancy sensor, Redendancy Degraded was DEasserted
PS_RDNDNT_MODE: PS Redundancy sensor, Redundancy Lost was DEasserted
FRU_PSU- PSU0_STATUS: Power Supply sensor for FRU_PSU0, Power Supply input lost (AC/DC) was DEasserted
FRU_PSU1 PSU_STATUS: PowerSupply sensor for FRU_PSU1, Power Supply Failure detected was DEasserted

I did get one other interesting alert in the SEL after running HUU (updating the firmware):
IRQ_P2_RDIM_EVNT: Processor sensor, Limit Exceeded was asserted.
But that was a one-time event.

Finally, I noticed there is an amber LED on the mobo near header P35.


I hope somebody can help me.  Maybe if they have had a similar problem. I am not finding an answer the Cisco manuals.

Server Profile
Dual 750 Watt power supply
96 GB RAM
2 CPU's (Xeon 5550)
No Cisco VIC
RAID with LSI SAS3081E-R card
CIMC FIRMWARE is at 1.4(3)
Power Supply Firmware – Z1.00.20

One more thing –
I eyeballed the logged events in the CIMC and pulled a couple – mainly based on the keywords “power” and “power supply”:
Notice-BMC:pmbus_pwrsply_mngr:567- pmbus_pwrsply_mngr.c:970:Power Supply 0x1, version Z1.00.20, Successfully changed PWM to 0x1E00. Old value=
Notice-BMC:bioscom:--lv_mode_dimm_support.c:217:[transition_function]Transition to [High Voltage Mode] Success
Notice-BMC:pmbus_pwrsply_mngr:567- pmbus_pwrsply_mngr.c:1258:Power Supply 0x1 Presence Changed to present.
Warning-BMC:kernel:--<4>[platform_power_state_irq_handler]:19:Platform is LA: Asserted
Debug-BMC:bioscom-pwrcap_set_pwr_ctrl.c:71:[pwrcap_read_config_file]Power Cap configuration (read) - PW:0 CW:0 NCA:3 Enable:0
Notice-BMC:kernel:--<5>[pilot2_power_init]:814:INIT Blade Power State is ... [ OFF ]
-Warning-BMC:kernel:--<4>Memory policy: ECC disabled, Data cache writeback

One last thing – the manual does recommend pulling the power supply out and pushing it back it as a step in problem solving. I did not do that yet. If I do that and the events/LEDs stop appearing, can I assume maybe that the problem was a fluke? Have other people solved problems like this by just reinserting the PSU?

Thanks again.

17 REPLIES 17

Note - the P1VX_IOH events have not occurred at all since the time of - or maybe a day before the I performed firmware upgrade with HUU.

Ok, if that error message is gone ... do you see any alert on CIMC about the PSUs, I mean, an active alert now, not something in the logs...

 

-Kenny

I don't always catch the error when it is happening. It self corrects.  Sometimes I do see the error - I see the fault LEDs and fault alert in the CIMC.  

 

RIght now I am getting some advice from a an earlier post I made about this problem at communities.cisco.com - apparently the PSU problem I am having is a known issue.

 

I'll post the link below. 

Content for Community-Ad