Disable or Modify F0409 and F0410 Faults ONLY

wsanders · ‎09-28-2018

We have a couple of blades where the

"Thermal condition on chassis X is upper-critical" (F0409) and "Thermal condition on chassis 3 is upper-non-critical" (F0410) alerts are constantly flapping. It appears the last couple of blades (8 and sometimes 7) in each chassis run a little hot (88-89 deg C) when the CPUs are under heavy load. Cisco support seems to think this is normal.

Is there a way to disable or modify *just these faults* for *just blade 8* in each chassis? "Fault suppression" lets me only disable all alerts for the blade.

Kirk J · ‎09-28-2018

Greetings.

Unfortunately, you are correct, in that the UCSM does not currently have a way to disable individual alert types (like you can with ACI), aside from the complete blade fault suppression.

The b200m3/m4/m5 design does allow for the 2nd CPU to run a bit warmer due to airflow design (air passes over 1st cpu, before going past 2nd cpu).

Usually when we have customers bumping into the UNC range, which is meant to trigger the chassis fans to rev up, both the CPU load is high, and the ambient temps tend to be in the lower to mid 70's F.

Please log into the UCSM CLI via putty/ssh:

#connect cimc x/y (chassis#/blade#)

cimc#sensors

Look at he P1 and P2 _TEMP_SENS values, and see what is specified for the upper critical and upper non recoverable values (to get idea of what those ranges are).

In 4.01 code, it appears the Upper non-critical value is no longer defined, so I'm wondering if you would no longer get those type of UNC alerts... What UCSM and Blade firmware level are you on?

What is the TEMP_SENS_FRONT current values?

Thanks,

Kirk...

wsanders · ‎09-28-2018

Sensor Name    | Reading | Unit      | ... | LNR | LC | LNC | UNC| UC      | UNR     |
TEMP_SENS_FRONT | 24.000 | degrees C | OK | na | na | na | na    | 75.000 | 85.000 |
TEMP_SENS_REAR | 47.000  | degrees C | OK  | na  | na | na  | na | 75.000  | 85.000 |
GPU1_TEMP_SENS | na      | degrees C | na  | na  | na | na  | na | 162.000 | 170.000 |
P1_TEMP_SENS   | 54.000  | degrees C | OK  | na  | na | na  | na | 88.000  | 93.000 |
P2_TEMP_SENS   | 84.000  | degrees C | UNC | na  | na | na  | na | 88.000  | 93.000 |

CIMC version is [ sensors ]# version: ver: 3.1(26g). UCS package is 3.2(3g)

LNC and UNC are all "na" - so why would get get spam about UNC?

.

I don't really care about the noncritical limits being removed, the UCS will still spam us about the UC and UNR, right? So unless both values can be removed or bumped up, there's still no way to suppress unless we suppress the whole blade?

wsanders · ‎09-28-2018

Now that I see a 30 degree difference between the CPUs I have a new guess about what is happening: Are the front and rear heat sinks different part numbers? I have seen this before in other hardware with a similar layout where the waste heat from CPU1 has to cool CPU2.

On other blades in Position 8 I see only a 23 or 23 deg difference, and CPU1 is running at 50 instead of 54 deg

Kirk J · ‎09-30-2018

What's your front temp sensor showing on your blades closest to floor (i.e. high # slots like 7,8)?

Yes, there are different heatsink part #s for the front and back heatsinks on B200M4/M5 servers.

Can you confirm some other ambient temp readings in that same environment?

The 75 degree F reading on the front temp sensor seems like that's a bit on the warm side for a data center.

Thanks,

Kirk...