Solved: UCS CPU temperature thresholds

williambiiii · ‎07-03-2018

I have 2 data centers with a B200 M4 blade chassis in each location. A few times a year, the temperature in the DC´s raises to the point where it completely shuts down the systems. I was trying to find a way to automate the servers so that if they get too hot, they automatically shut down, but gracefully shutdown. Maintaining VMware mode hosts, deal with any exceptions, shut down servers and maybe additional hardware. I was thinking that there could be a way to do this with an Automation tool like Ansible, but there might also be a way to do it with UCS Director or UCSM. Any ideas?

Kirk J · ‎07-05-2018

Greetings.

We sometimes see similar request along the UPS/power outage area.

I'm not quite sure how you would do this with director (still looking), but you would need a monitoring/polling device that allowed you to trigger scripts based on reached threshhold.

I've seen some UPSs monitor temps and humidity that would also probably do this.

There are some USB stick based devices (and some ethernet ones) that monitor temps, and have agents that would run a service on a listening workstation that might trigger scripts (i.e. vmware powertools, UCS power tools).

You might run scheduled UCS powershell script that polled the CPU#2 temps every 5/10 minutes (usually the 1st to get hot due to air flow design) or the inflow temp sensor on the blade:

PS C:\Users\kirk\ Get-UcsBlade -ChassisId 1 -SlotId 7 | Get-UcsStatistics -Current | Where {$_.Rn -like "temp-stats"}

FmTempSenIo : 22.000000
FmTempSenIoAvg : 22.000000
FmTempSenIoMax : 22.000000
FmTempSenIoMin : 22.000000
FmTempSenRear : 31.000000
FmTempSenRearAvg : 31.000000
FmTempSenRearL : not-applicable
FmTempSenRearLAvg : not-applicable
FmTempSenRearLMax : not-applicable
FmTempSenRearLMin : not-applicable
FmTempSenRearMax : 31.000000
FmTempSenRearMin : 31.000000
FmTempSenRearR : not-applicable
FmTempSenRearRAvg : not-applicable
FmTempSenRearRMax : not-applicable
FmTempSenRearRMin : not-applicable
Intervals : 58982430
Sacl :
Suspect : no
Thresholded :
TimeCollected : 2018-07-06T01:10:02.890
Update : 65557
Ucs : Cisco-UCS-001
Dn : sys/chassis-1/blade-7/board/temp-stats
Rn : temp-stats
Status :
XtraProperty : {}

or

Get-UcsStatistics -Current | Where {$_.Rn -like "temp-stats" -or $_.Dn -like "*cpu*/env-stats"} | Select Ucs,Dn,FmTempSenIo,FmTempSenIoMax,FmTempSenRear,FmTempSenRearMax,Temperature,TemperatureAvg,TemperatureMax,TimeCollected

Ucs : UCS-domain-001
Dn : sys/chassis-1/blade-8/board/cpu-1/env-stats
FmTempSenIo :
FmTempSenIoMax :
FmTempSenRear :
FmTempSenRearMax :
Temperature : 36
TemperatureAvg : 36
TemperatureMax : 36
TimeCollected : 2018-07-06T01:58:09.804

Ucs : F241-03-08-FI
Dn : sys/chassis-1/blade-8/board/cpu-2/env-stats
FmTempSenIo :
FmTempSenIoMax :
FmTempSenRear :
FmTempSenRearMax :
Temperature : 36.5
TemperatureAvg : 36.5
TemperatureMax : 36.5
TimeCollected : 2018-07-06T01:58:09.804

Ucs : UCS-domain-001
Dn : sys/chassis-1/blade-8/board/temp-stats
FmTempSenIo : 23.000000
FmTempSenIoMax : 23.000000
FmTempSenRear : 31.000000
FmTempSenRearMax : 31.000000
Temperature :
TemperatureAvg :
TemperatureMax :
TimeCollected : 2018-07-06T01:58:09.804

If you decide what is your threshold, once reached, you could run something like the PS power off script at https://communities.cisco.com/docs/DOC-36750

CPU UNC, UC, UNR temp thresholds vary quite a bit by CPU model, but the CPU#2 tends to start bouncing around the UNC levels once your data centers get in the lower 70s (with good load).

Just a couple of ideas.

Kirk...

View solution in original post

Kirk J · ‎07-05-2018

Greetings.

We sometimes see similar request along the UPS/power outage area.

I'm not quite sure how you would do this with director (still looking), but you would need a monitoring/polling device that allowed you to trigger scripts based on reached threshhold.

I've seen some UPSs monitor temps and humidity that would also probably do this.

There are some USB stick based devices (and some ethernet ones) that monitor temps, and have agents that would run a service on a listening workstation that might trigger scripts (i.e. vmware powertools, UCS power tools).

You might run scheduled UCS powershell script that polled the CPU#2 temps every 5/10 minutes (usually the 1st to get hot due to air flow design) or the inflow temp sensor on the blade:

PS C:\Users\kirk\ Get-UcsBlade -ChassisId 1 -SlotId 7 | Get-UcsStatistics -Current | Where {$_.Rn -like "temp-stats"}

FmTempSenIo : 22.000000
FmTempSenIoAvg : 22.000000
FmTempSenIoMax : 22.000000
FmTempSenIoMin : 22.000000
FmTempSenRear : 31.000000
FmTempSenRearAvg : 31.000000
FmTempSenRearL : not-applicable
FmTempSenRearLAvg : not-applicable
FmTempSenRearLMax : not-applicable
FmTempSenRearLMin : not-applicable
FmTempSenRearMax : 31.000000
FmTempSenRearMin : 31.000000
FmTempSenRearR : not-applicable
FmTempSenRearRAvg : not-applicable
FmTempSenRearRMax : not-applicable
FmTempSenRearRMin : not-applicable
Intervals : 58982430
Sacl :
Suspect : no
Thresholded :
TimeCollected : 2018-07-06T01:10:02.890
Update : 65557
Ucs : Cisco-UCS-001
Dn : sys/chassis-1/blade-7/board/temp-stats
Rn : temp-stats
Status :
XtraProperty : {}

or

Get-UcsStatistics -Current | Where {$_.Rn -like "temp-stats" -or $_.Dn -like "*cpu*/env-stats"} | Select Ucs,Dn,FmTempSenIo,FmTempSenIoMax,FmTempSenRear,FmTempSenRearMax,Temperature,TemperatureAvg,TemperatureMax,TimeCollected

Ucs : UCS-domain-001
Dn : sys/chassis-1/blade-8/board/cpu-1/env-stats
FmTempSenIo :
FmTempSenIoMax :
FmTempSenRear :
FmTempSenRearMax :
Temperature : 36
TemperatureAvg : 36
TemperatureMax : 36
TimeCollected : 2018-07-06T01:58:09.804

Ucs : F241-03-08-FI
Dn : sys/chassis-1/blade-8/board/cpu-2/env-stats
FmTempSenIo :
FmTempSenIoMax :
FmTempSenRear :
FmTempSenRearMax :
Temperature : 36.5
TemperatureAvg : 36.5
TemperatureMax : 36.5
TimeCollected : 2018-07-06T01:58:09.804

Ucs : UCS-domain-001
Dn : sys/chassis-1/blade-8/board/temp-stats
FmTempSenIo : 23.000000
FmTempSenIoMax : 23.000000
FmTempSenRear : 31.000000
FmTempSenRearMax : 31.000000
Temperature :
TemperatureAvg :
TemperatureMax :
TimeCollected : 2018-07-06T01:58:09.804

If you decide what is your threshold, once reached, you could run something like the PS power off script at https://communities.cisco.com/docs/DOC-36750

CPU UNC, UC, UNR temp thresholds vary quite a bit by CPU model, but the CPU#2 tends to start bouncing around the UNC levels once your data centers get in the lower 70s (with good load).

Just a couple of ideas.

Kirk...

williambiiii · ‎07-06-2018

Hey Kirk! first off, thank you so much for the reply. There are a lot of great ideas in here. Are there any resources for those USB stick based devices that you mentioned?

The end goal would be:

Once a significant temperature was reached on the UCS, this event would trigger an event in VMware that would power off the VM´s gracefully. As we do get spikes in temperature that cause the B series to shutdown, and this of course powers off the VMs without warning. There might some events going on like vMotion that this could interrupt.

Many thanks!

William