cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3252
Views
15
Helpful
3
Replies

Get high CPU temp alerts for two blades

nikthatte
Level 1
Level 1

My UCS environment temprature threshold policy is configured as such:

Critical: between 50 & 52 degress C

Major: 47 & 49 degress C

Do these thresholds look good to you guys? The site preparation guide says CPUs shouldn't run high than 82 degress C which we are WELL BELOW!

 

I have two B200 M3 blades in the same chassis that are frequently complaining about high temp. It is generally CPU 2 on both blades which makes sense since it is the back but it is almost always hovering in the 47.5 to 52 area. I was initially told this is a FW bug and will get resolved when I go to 3.1 (3) which I since have but we still see those alerts. 

 

Our datacenter is sufficiently cooled and none of our other equipment is complaining about high temp. I have looked at other blades in the ucs domain, in the same chassis and other chassis and while some of the CPU 2's are approaching 46 degress, none of the other M2, M3 or M4s every complain about high temp. My question is, should I open a TAC case and get the CPUs replaced or should I adjust my temp threshold policy to something like, Major around 60 degress and Critical around 70?

1 Accepted Solution

Accepted Solutions

mojafri
Cisco Employee
Cisco Employee

Hi,

 

Threshold does seems to be way too low. If I take an example of B200 M4, 

 

log onto to the CLI:

#connect cimc x/y

(x - chassis / y - blade #)

#sensors

Look for the values:

 

Sensor Name      | Reading | Unit         | Status | LNR     | LC      | LNC     | UNC     | UC      | UNR     |
=================|=========|==============|========|=========|=========|=========|=========|=========|=========|
P1_TEMP_SENS     | 41.000  | degrees C    | OK     | na      | na      | na      | na      | 86.000  | 91.000  |
P2_TEMP_SENS     | 47.500  | degrees C    | OK     | na      | na      | na      | na      | 86.000  | 91.000  |

LNR: Lower Non Recoverable Threshold

LC : Lower Critical Threshold

LNC: Lower Non Critical Threshold

UNC: Upper Non Critical Threshold

UC : Upper Critical Threshold

UNR: Upper Non Recoverable Threshold

 

 

UNC and UC conditions are informational and do not cause performance degradation.

Only UNR / PROCHOT will cause the Intel CPU to experience performance degradation because at this point Intel will use speedstep to intentionally dynamically lower the input power/voltage to reduce the temperature and protect the CPU.

 

If you see UC is 86 C which might vary in your case as its B200 M3/diff CPUs. So in general, UCSM itself will raise fault if temp will go high without using user-configurable thermal policy.

  

Please rate if you find it helpful.

 

Regards,

MJ

View solution in original post

3 Replies 3

mojafri
Cisco Employee
Cisco Employee

Hi,

 

Threshold does seems to be way too low. If I take an example of B200 M4, 

 

log onto to the CLI:

#connect cimc x/y

(x - chassis / y - blade #)

#sensors

Look for the values:

 

Sensor Name      | Reading | Unit         | Status | LNR     | LC      | LNC     | UNC     | UC      | UNR     |
=================|=========|==============|========|=========|=========|=========|=========|=========|=========|
P1_TEMP_SENS     | 41.000  | degrees C    | OK     | na      | na      | na      | na      | 86.000  | 91.000  |
P2_TEMP_SENS     | 47.500  | degrees C    | OK     | na      | na      | na      | na      | 86.000  | 91.000  |

LNR: Lower Non Recoverable Threshold

LC : Lower Critical Threshold

LNC: Lower Non Critical Threshold

UNC: Upper Non Critical Threshold

UC : Upper Critical Threshold

UNR: Upper Non Recoverable Threshold

 

 

UNC and UC conditions are informational and do not cause performance degradation.

Only UNR / PROCHOT will cause the Intel CPU to experience performance degradation because at this point Intel will use speedstep to intentionally dynamically lower the input power/voltage to reduce the temperature and protect the CPU.

 

If you see UC is 86 C which might vary in your case as its B200 M3/diff CPUs. So in general, UCSM itself will raise fault if temp will go high without using user-configurable thermal policy.

  

Please rate if you find it helpful.

 

Regards,

MJ

Hey thats awesome thanks! What would be reasonable thresholds? Does Major around 60 degress and Critical around 70 sound good to you?

Till UC its fine...but yeah you may set the above one as well. I've hardly seen cpu going to this temp but again 60-70 range won't cause any issue in server operation.

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card