cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
896
Views
7
Helpful
15
Replies

C220 M4 — Fans running at maximum speed without alarms on CIMC 4.1(2m)

Ander Vazquez
Level 1
Level 1

Hi everyone,

I have a Cisco C220 M4 server with the following firmware versions:

  • BIOS: C220M4.3.0.4c.0.0502191259

  • CIMC: 4.1(2m)

The issue I'm facing is that the fans are spinning at extremely high speeds, generating excessive noise, although they are reported as Normal in CIMC and no alarms are shown. Here are some of the readings:

FAN1_TACH1 Normal 16000 RPM
FAN1_TACH2 Normal 20000 RPM
FAN2_TACH1 Normal 17100 RPM
FAN2_TACH2 Normal 18400 RPM
...

Configured fan policy: Low
Installed PCI devices:

  • Cisco 12G SAS Modular RAID Controller

  • Intel® I350 1Gbps Network Controller

Actions taken:

  • Server reboot

  • Verified fan policies

Despite the reboot and confirming the fan policies are set to Low, the fans remain at these high speeds.

Has anyone experienced a similar situation on CIMC version 4.1, or have any ideas about what could be causing this behavior?

Any suggestions or shared experiences would be greatly appreciated.

Thanks in advance!


 
 



2 Accepted Solutions

Accepted Solutions

populateStorageCard: min_thr_temp: 90, max_temp: 35

 ==== Sensor Number: 67, Unavailable for 460588 ticks
 ==== Front Panel sensor data unavailable, set fan speed window to 90

--

Sometimes storage controller temperature trigger high fans.

  • mnt/jffs2/storage-data
+controller-temperature: 255

But looking at other logs (var/log/tty_log_SLOT-HBA) this seems like a false value. Maybe.
Have also seen where a RAID controller (which sits physically above the MLOM) in a system WITHOUT a MLOM (your system does not have a MLOM) show elevated RAID controller temperatures due to less than expected airflow.

--

Went and looked up "Sensor Number: 67" and 67 is:

Sensor: FP_TEMP_SENSOR

 Your logs show (tmp/tech_support) the FP_TEMP_SENSOR sensor as "na".

FP_TEMP_SENSOR   | na      | degrees C    | na     | na      | na      | na      | 40.000  | 45.000  | 50.000  | 
..
Sensor ID : FP_TEMP_SENSOR (0x43)

(Can also confirm what "67" is from the logs as 0x43 hexadecimal is 67 decimal.)
These details line up with that third line stating "Front Panel sensor data unavailable" and setting fans to "90".
The "90" is usually 90 PWM or 90% of full speed.

Never have seen this exact combination so my theory is this a faulty front panel sensor.
Have next to zero docs on this front panel sensor, so have no way to test/validate/verify this theory.

--

My workaround-of-last-resort would try to power drain (remove both power cables for 60 seconds) which would remove power from CIMC which causes CIMC to restart.
Have seen a few things NOT be resolved by simply rebooting CIMC, but a full power drain is required.

If power drain doesn't fix it, then I am at not sure what else to do. . . buy ear-plugs?

(If this were under support, then maybe RMA + EFA to have engineering take a look in the EFA lab.)

View solution in original post

The complete shutdown with power removal for more than 60 seconds had already been performed a couple of times, but without a satisfactory result.
I replaced the front panel from a spare UCS we have, and with that, the fans have stopped failing.

Thank you for everything!

View solution in original post

15 Replies 15

BrianSekleckiGE
Level 1
Level 1

I have no experience with the older M4, but, the Dell/EMC PowerEdge family behaves exactly the same way for any number of other ( undocumented ) environmental conditions :

*  Chassis intrusion sensor alarm or cover removed

* Power supply non-redundant, etc.

* OS Not booted (OS Watchdog timer) for an extended period of time...

Thank you for your response:

  • Chassis intrusion sensor alarm or cover removed: No alerts are generated.
  • Power supply non-redundant, etc.: No alerts are generated.
  • OS Not booted (OS Watchdog timer) for an extended period of time: The operating system boots correctly.

BrianSekleckiGE
Level 1
Level 1

Stupid/Silly question, but the intake air temperature sensor does read =< 25degC  ?

I found a Reddit forum where someone had this problem, and it was caused by non-Cisco drives being installed.

I found one older Cisco.com commity forum post about an M3 where:

  [ https://community.cisco.com/t5/unified-computing-system-discussions/cisco-ucs-c240-m3s-fans-are-loud-and-on-full-blast-24-7/td-p/3687639 ] 

"I have discovered the issue in my case....YOU WON'T BELIEVE IT!!!!! Turns out the Console port on the front of the server has a ribbon cable......and it was not connected. Once reconnecting the cable. The server is 100% quiet!  Hope this helps you all!"

The point is: Think outside the box!

There will be all kinds of hooks in the code that, in certain conditions (that do not correlate with an alarm in CICM) cause this condition.  They were put in there by product managers. 

 

Finally, If you're uncertain about the validity of input temperatures, check it with a laser thermometer ($15), or a FLIR USB-C addon for your mobile, those are down to $325 now ! (Down from $30,000 15 years ago)

I’m on that as well. We replaced all the fans and power supplies, and the behavior remained exactly the same.
Thank you!

I tried disconnecting and reconnecting the ribbon cable on the Console port, but unfortunately the server is still making excessive noise. It didn't fix the issue in my case.

I decommissioned my M4's some time back, but maybe there is something odd in the settings. You could try resetting the CIMC to factory default. Has the CIMC been updated without using the HUU image? If the BIOS version and the CIMC version are out of sync, unpredictable things can happen.

The UCS was updated through the HUU image from version 2.0(4a) to 3.0(4), and subsequently to version 4.1(3). Excessive noise was present in all versions.

I will try restoring the factory settings.

Thank you.

Steven Tardy
Cisco Employee
Cisco Employee

Collect a tech support file from CIMC for review.

Can either upload it to the community forum or email to me directly: sttardy @ cisco.com

We are observing abnormal behavior in the system logs related to temperature sensors and fan control. Below is an extract from the relevant log entries:


==== Fan Number: 26 Value: 184 Time: 460588
populateStorageCard: min_thr_temp: 90, max_temp: 35

==== Sensor Number: 67, Unavailable for 460588 ticks
==== Front Panel sensor data unavailable, set fan speed window to 90
==== Sensor Number: 211 Value: 24
==== Sensor Number: 212 Value: 22
==== Sensor Number: 213 Value: 21
==== Sensor Number: 214 Value: 24
==== Sensor Number: 81 Value: 27
==== Sensor Number: 87 Value: 25
==== Sensor Number: 179 Value: 25
==== Sensor Number: 142 Value: 32
==== Sensor Number: 501 Value: 35
==== CPU IDLE, Override minPWM to: 90
==== Zone: 1, Max Sensor Number: 81, Value: 27
==== numMaxSensors is now: 30
==== Altitude index = 1
==== FP speed: min = 90 max = 90
==== Set fan speed to 90
==== Zone: 2, Max Sensor Number: 211, Value: 24
==== numMaxSensors is now: 30
==== Altitude index = 1
==== FP speed: min = 90 max = 90
==== Set fan speed to 90
==== Zone: 3, Max Sensor Number: 211, Value: 24
==== numMaxSensors is now: 30
==== Altitude index = 1
==== FP speed: min = 90 max = 90
==== Set fan speed to 90

We noticed that Sensor Number: 67 has been unavailable for 460,588 ticks, which seems to be causing the system to lock the fan minimum PWM at 90% continuously. We believe this sensor might correspond to the ROC temperature sensor from the RAID controller.

Could the problem be the cable itself, or perhaps it isn't fully connected on one side or the other? I suppose it could also be the RAID module itself.

The cabling was checked, and the battery cable of the RAID controller was found disconnected. It was reconnected, the controller was disconnected and reconnected, and the server was restarted — but the fans are still running at full speed.

* We believe this sensor might correspond to the ROC temperature sensor from the RAID controller.

ROC – RAID on a Chip
AKA, the RAID HBA >:}

So is this an external thermocouple/RTD sensor wire, or something that should be read from a PCI address built into the IC?

populateStorageCard: min_thr_temp: 90, max_temp: 35

 ==== Sensor Number: 67, Unavailable for 460588 ticks
 ==== Front Panel sensor data unavailable, set fan speed window to 90

--

Sometimes storage controller temperature trigger high fans.

  • mnt/jffs2/storage-data
+controller-temperature: 255

But looking at other logs (var/log/tty_log_SLOT-HBA) this seems like a false value. Maybe.
Have also seen where a RAID controller (which sits physically above the MLOM) in a system WITHOUT a MLOM (your system does not have a MLOM) show elevated RAID controller temperatures due to less than expected airflow.

--

Went and looked up "Sensor Number: 67" and 67 is:

Sensor: FP_TEMP_SENSOR

 Your logs show (tmp/tech_support) the FP_TEMP_SENSOR sensor as "na".

FP_TEMP_SENSOR   | na      | degrees C    | na     | na      | na      | na      | 40.000  | 45.000  | 50.000  | 
..
Sensor ID : FP_TEMP_SENSOR (0x43)

(Can also confirm what "67" is from the logs as 0x43 hexadecimal is 67 decimal.)
These details line up with that third line stating "Front Panel sensor data unavailable" and setting fans to "90".
The "90" is usually 90 PWM or 90% of full speed.

Never have seen this exact combination so my theory is this a faulty front panel sensor.
Have next to zero docs on this front panel sensor, so have no way to test/validate/verify this theory.

--

My workaround-of-last-resort would try to power drain (remove both power cables for 60 seconds) which would remove power from CIMC which causes CIMC to restart.
Have seen a few things NOT be resolved by simply rebooting CIMC, but a full power drain is required.

If power drain doesn't fix it, then I am at not sure what else to do. . . buy ear-plugs?

(If this were under support, then maybe RMA + EFA to have engineering take a look in the EFA lab.)

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card