cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
cancel
1450
Views
0
Helpful
4
Replies

Disable or Modify F0409 and F0410 Faults ONLY

wsanders
Level 1
Level 1

We have a couple of blades where the 

"Thermal condition on chassis X is upper-critical" (F0409) and "Thermal condition on chassis 3 is upper-non-critical" (F0410) alerts are constantly flapping. It appears the last couple of blades (8 and sometimes 7) in each chassis run a little hot (88-89 deg C) when the CPUs are under heavy load. Cisco support seems to think this is normal.

Is there a way to disable or modify *just these faults* for *just blade 8* in each chassis? "Fault suppression" lets me only disable all alerts for the blade.
4 Replies 4

Kirk J
Cisco Employee
Cisco Employee

Greetings.

Unfortunately, you are correct, in that the UCSM does not currently have a way to disable individual alert types (like you can with ACI), aside from the complete blade fault suppression.

The b200m3/m4/m5 design does allow for the 2nd CPU to run a bit warmer due to airflow design (air passes over 1st cpu, before going past 2nd cpu).

Usually when we have customers bumping into the UNC range, which is meant to trigger the chassis fans to rev up, both the CPU load is high, and the ambient temps tend to be in the lower to mid 70's F.

Please log into the UCSM CLI via putty/ssh:

#connect cimc x/y (chassis#/blade#)

cimc#sensors

 

Look at he P1 and P2 _TEMP_SENS values, and  see what is specified for the upper critical and upper non recoverable values (to get idea of what those ranges are).

In 4.01 code, it appears the Upper non-critical value is no longer defined, so I'm wondering if you would no longer get those type of UNC alerts... What UCSM and Blade firmware level are you on?

What is the TEMP_SENS_FRONT current values?

 

Thanks,

Kirk...

 

Sensor Name    | Reading | Unit      | ... | LNR | LC | LNC | UNC| UC      | UNR     |
TEMP_SENS_FRONT | 24.000 | degrees C | OK | na | na | na | na | 75.000 | 85.000 |
TEMP_SENS_REAR | 47.000 | degrees C | OK | na | na | na | na | 75.000 | 85.000 |
GPU1_TEMP_SENS | na | degrees C | na | na | na | na | na | 162.000 | 170.000 |
P1_TEMP_SENS | 54.000 | degrees C | OK | na | na | na | na | 88.000 | 93.000 |
P2_TEMP_SENS | 84.000 | degrees C | UNC | na | na | na | na | 88.000 | 93.000 |

CIMC version is [ sensors ]# version: ver: 3.1(26g). UCS package is 3.2(3g)

 

LNC and UNC are all "na" - so why would get get spam about UNC?

.

 

I don't really care about the noncritical limits being removed, the UCS will still spam us about the UC and UNR, right? So unless both values can be removed or bumped up, there's still no way to suppress unless we suppress the whole blade?

Now that I see a 30 degree difference between the CPUs I have a new guess about what is happening: Are the front and rear heat sinks different part numbers? I have seen this before in other hardware with a similar layout where the waste heat from CPU1 has to cool CPU2.

 

On other blades in Position 8 I see only a 23 or 23 deg difference, and CPU1 is running at 50 instead of 54 deg

What's your front temp sensor showing on your blades closest to floor (i.e. high # slots like 7,8)?

Yes, there are different heatsink part #s for the front and back heatsinks on B200M4/M5 servers.

Can you confirm some other ambient temp readings in that same environment?

The 75 degree F reading on the front temp sensor seems like that's a bit on the warm side for a data center.

 

Thanks,

Kirk...

 

Review Cisco Networking products for a $25 gift card