10-14-2018 01:44 PM
My UCS environment temprature threshold policy is configured as such:
Critical: between 50 & 52 degress C
Major: 47 & 49 degress C
Do these thresholds look good to you guys? The site preparation guide says CPUs shouldn't run high than 82 degress C which we are WELL BELOW!
I have two B200 M3 blades in the same chassis that are frequently complaining about high temp. It is generally CPU 2 on both blades which makes sense since it is the back but it is almost always hovering in the 47.5 to 52 area. I was initially told this is a FW bug and will get resolved when I go to 3.1 (3) which I since have but we still see those alerts.
Our datacenter is sufficiently cooled and none of our other equipment is complaining about high temp. I have looked at other blades in the ucs domain, in the same chassis and other chassis and while some of the CPU 2's are approaching 46 degress, none of the other M2, M3 or M4s every complain about high temp. My question is, should I open a TAC case and get the CPUs replaced or should I adjust my temp threshold policy to something like, Major around 60 degress and Critical around 70?
Solved! Go to Solution.
10-14-2018 11:58 PM
Hi,
Threshold does seems to be way too low. If I take an example of B200 M4,
log onto to the CLI:
#connect cimc x/y
(x - chassis / y - blade #)
#sensors
Look for the values:
Sensor Name | Reading | Unit | Status | LNR | LC | LNC | UNC | UC | UNR | =================|=========|==============|========|=========|=========|=========|=========|=========|=========| P1_TEMP_SENS | 41.000 | degrees C | OK | na | na | na | na | 86.000 | 91.000 | P2_TEMP_SENS | 47.500 | degrees C | OK | na | na | na | na | 86.000 | 91.000 |
LNR: Lower Non Recoverable Threshold
LC : Lower Critical Threshold
LNC: Lower Non Critical Threshold
UNC: Upper Non Critical Threshold
UC : Upper Critical Threshold
UNR: Upper Non Recoverable Threshold
UNC and UC conditions are informational and do not cause performance degradation.
Only UNR / PROCHOT will cause the Intel CPU to experience performance degradation because at this point Intel will use speedstep to intentionally dynamically lower the input power/voltage to reduce the temperature and protect the CPU.
If you see UC is 86 C which might vary in your case as its B200 M3/diff CPUs. So in general, UCSM itself will raise fault if temp will go high without using user-configurable thermal policy.
Please rate if you find it helpful.
Regards,
MJ
10-14-2018 11:58 PM
Hi,
Threshold does seems to be way too low. If I take an example of B200 M4,
log onto to the CLI:
#connect cimc x/y
(x - chassis / y - blade #)
#sensors
Look for the values:
Sensor Name | Reading | Unit | Status | LNR | LC | LNC | UNC | UC | UNR | =================|=========|==============|========|=========|=========|=========|=========|=========|=========| P1_TEMP_SENS | 41.000 | degrees C | OK | na | na | na | na | 86.000 | 91.000 | P2_TEMP_SENS | 47.500 | degrees C | OK | na | na | na | na | 86.000 | 91.000 |
LNR: Lower Non Recoverable Threshold
LC : Lower Critical Threshold
LNC: Lower Non Critical Threshold
UNC: Upper Non Critical Threshold
UC : Upper Critical Threshold
UNR: Upper Non Recoverable Threshold
UNC and UC conditions are informational and do not cause performance degradation.
Only UNR / PROCHOT will cause the Intel CPU to experience performance degradation because at this point Intel will use speedstep to intentionally dynamically lower the input power/voltage to reduce the temperature and protect the CPU.
If you see UC is 86 C which might vary in your case as its B200 M3/diff CPUs. So in general, UCSM itself will raise fault if temp will go high without using user-configurable thermal policy.
Please rate if you find it helpful.
Regards,
MJ
10-15-2018 05:53 AM
10-17-2018 04:39 AM
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide