cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
6778
Views
45
Helpful
15
Replies

B200 M4 - Processor 2 Running very hot with E5-2699 CPUs

wsanders
Level 1
Level 1

We've got some newly-purchased B200 M4s with E5-2699 (145W) CPUs.

The blades at the end of the chassis (slot 7 and 8) are throwing temperature alerts for CPU2, which is running as hot as 80 deg C on one of the blades. At the same time, CPU1 is only mazing out at 52 or so.

It would seem that the thermal design for this chassis with this blade and CPU is deficient. Only blades in slots 7 and 8 go over the 80 deg alert threshold. CPU2 in slows 1 and 3 run at more normal temps, max 67 deg C or so.

There is also a 15 to 20 deg difference between CPU1 and CPU2 on all our 2699-equipped blades.

Are we missing a baffle upgrade for the 145W CPUs? A heat sink upgrade? Perhaps there is a BIOS upgrade to run the fans at full speed? Has anyone else had this problem?

I have not yet pulled a blade for physical inspection - I've seen this problem before; it was when an incompetent FSE swapped front and back heat sinks on some HP servers that, like the B200s, had different size heat sinks for CPU1 (front) and CPU2 (rear). This caused a 25 degree delta between CPU temps. When the heat sinks were swapped back to their normal configuration, the CPUs ran at the same temperature.

15 Replies 15

Wes Austin
Cisco Employee
Cisco Employee

Hello,

What firmware are you running?

Are you getting any alarms or alerts in UCSM?

Have you opened a TAC case to investigate?

-Wes

Working on it - the Support Case Manager is down right now so I haven't updated the case I opened with details yet.

BIOS version is B200M4.3.1.2a.0.......

Lars-Rolf Rapp
Level 1
Level 1

We've that problem too.
It was said by TAC Support that this was an issue of an old firmware - but we were on an newer build at this time. Support said, that this is not a problem:

Hi Lars,

 

Thank you for the input,

 

It looks like transient issue, reported on firmware 2.2(5a),

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCux04770

 

Your system is however running 3.1, but I suspect this issue is related to above bug itself.

From sensor I find temperature is below Upper Non-Critical threshold hence it should be ok.

 

    Server 7:

        Package-Vers: 3.1(2e)B

        Upgrade-Status: Ready

    Server 8:

        Package-Vers: 3.1(2e)B

        Upgrade-Status: Ready

 

P1_TEMP_SENS     | 51.000  | degrees C    | OK     | na      | na      | na      | 79.000  | 89.000  | 99.000  |

P2_TEMP_SENS     | 73.500  | degrees C    | OK     | na      | na      | na      | 79.000  | 89.000  | 99.000  |

 

# 29 00 00 00 01 02 00 00 2E 91 1E 59 20 00 04 01 4F 00 00 00 81 57 9C 9E # 29 | 05/19/2017 08:31:10 | CIMC | Temperature P2_TEMP_SENS #0x4f | Upper Non-critical - going high | Deasserted | Reading 78 <= Threshold 79 degrees C

# 2B 00 00 00 01 02 00 00 A7 94 1E 59 20 00 04 01 4F 00 00 00 81 57 9B 9E # 2b | 05/19/2017 08:45:59 | CIMC | Temperature P2_TEMP_SENS #0x4f | Upper Non-critical - going high | Deasserted | Reading 77.50 <= Threshold 79 degrees C

 

Please let me know if you are facing any performance issues or business impact due to this.

 

Mentioned temperature shouldn’t be causing any HW failures on UCS system.

 

This issue is a little bit scruffy, because on VMware I receive more and more warnings about temperature issues of the hardware, but no warnings in UCS.
Hope this will be fixed soon.

Do you have a TAC SR to reference so I can take a look?

No problem: SR 682370882 : Temperature problem reported . In this case blades 3/7 and 3/8 are reported. Right now we have additionally 1/2, 1/5 and 7/4 with the same issue. Maybe it's only cosmetical, but we these hosts are throwing warings in VMware. If we have to ignore these warings, in my eyes it's critical because other warnings might be overseen.

Thanks. I took a look at the logs and don't see any hardware issues. They all look to circle back to the same defects.

 

Here is another one:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuh39242

 

It appears we are working on a solution in later code that will surpress these faults as they are not an actual problem. I understand that it can be misleading to recieve alerts from VMware, apoligies for that.

I have this problem on a Cisco UCS with B200 M4 blades 

bios is b200M4.3.1.3e.0.081120161737

Broad controller 12

CIMC controller 3.1(21a) 

UCS manager 2.2(8b)

 

Did we ever get a resolution for this and if so what code level will resolve the issue as it seem there is no more details than what is in the bug report which seem to be for lower code levels than i am seeing onsite and stil lwe see Vcenter reporting thermal errors, whilst Cisco UCS show no errors at all 

Fast forward to 2019, these blades still run boiling hot on CPU2, TAC still quotes that bug and says nothing to worry about.  It only happens on chassis which have intentionally limited their rear fan module speeds, which, ironically, can occur if you feed it air that is too cold.  If you're lucky enough to have a chassis which has raised its fans to a higher speed, you won't see this issue.  I've got two chassis same rack, one has fans running at 3800 rpm, fully populated, the other runs its fans at 3000 rpm, 5/8 populated.  The lesser populated chassis has CPU2's getting well into the mid 80's, some touching 87-88 range.  This is 10 degrees above Intel's Tcase recommendations for the chips in our servers.  UCS goes into alert state on and off all day long.  TAC says hey, just ignore those alarms, all is well.  Meantime, the hot chassis keeps having failed memory DIMMs across all the blades, the other has never had that occur....

Same.... even in version 4.0(4c) still getting this a lot when the proc runs hot.

My UCS dumpster fire still going strong; almost hitting 90 degrees now, UCS itself asserts critical status off and on all day long, Vmware goes red, Cisco says "this is normal"...

 

382 | 02/19/2020 15:30:03 GMT | CIMC | Temperature P2_TEMP_SENS #0x5b | Upper critical - going high | Asserted | Reading 89.50 >= Threshold 88 degrees C

 

Nothing like the official stance being that critical errors should simply be ignored and you should disable your monitoring system so it stops bothering you.

nirkaush
Cisco Employee
Cisco Employee

Hello,

 

Please have a look at this bug

 

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvu79969

 

This bug has been fixed from the developer's end and is undergoing QA tests. After this bug is verified in the future releases (please save this bug and enable for notification, you will get the known fixed releases), not only will you not see these CPU threshold events on Cisco UCSM, you will also not see them in any third part OS as well which has the capability to pull out the threshold events.

 

Thanks

Niraj

It is of course concerning that the 'fix' to this is to hide the alerts rather than adjust the fan speed so the high temperature issue doesn't occur to begin with.  Our fully populated chassis do not have this issue because they tend to run the fans about 800 rpm faster under normal load.  The chassis that are not fully populated will allow the CPU2's to bump against Intel's Tcase limit, where the performance is throttled back, speed the fans up long enough to clear the alarm, then throttle them back down, allowing the problem to cycle on and off indefinitely.  We also see increased memory chip failures in these chassis.

How are your the blades in your non fully populated chassis installed? Meaning which slots are they in? 

Since each chassis fan cools the two blades in front, just wondering do all fans have a blade in front?  

 

First five slots, working top down left to right per row.  The two top slots are the most problematic.  But, like I said, the problem would go away if Cisco would actually increase the fan speed, or let us control it.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: