cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
581
Views
1
Helpful
5
Replies

C240-M5SX - UCSC-SAS-M5HD getting errant over temperature messages

I am in the process of migrating my lab from M4 to M5 servers, and I have hit a snag. The problem is with my main NAS which is running TrueNAS 13.0. As part of the upgrade I ordered Western Digital 2TB WD Red SA500 NAS drives to replace the SATA HDD's in the old unit. I installed 2 of the brand new drives and after a few moments the fans go to full speed and the HBA shows those 2 drives as over temperature. It seems awfully unlikely that I got 2 DOA drives. I was able to create a ZFS pool on them, so the OS thinks they are performing correctly. BIOS version: C240M5.4.3.2a.0.0613231011, CIMC version: 4.2(3h).

Here are all the particulars.

 

                 Chassis Type          : Rack Mount Chassis
                 Chassis Part Number   : 74-105778-03
                 Chassis Extra         : Cisco Systems Inc
                 Board Mfg             : Cisco Systems Inc
                 Board Product         : UCSC-C240-M5SX
                 Board Part Number     : 74-105773-01
                 Board Extra           : D05V03
                 Board Extra           : 0000000000
                 Product Manufacturer  : Cisco Systems Inc
                 Product Name          : UCSC-C240-M5SX
                 Product Part Number   : 74-105778-03
                 Product Version       : A0
                 Product Asset Tag     : Unknown
                 Product Extra         : D0V03

CPU
  id: 1
  SocketDesignation: CPU1
  ProcessorManufacturer: Intel(R) Corporation
  ProcessorFamily: Xeon
  ThreadCount: 8
  SerialNumber: Not Specified
  ProcessorVersion: Intel(R) Xeon(R) Gold 6144 CPU @ 3.50GHz
  CurrentSpeed: 3500
  CoreCount: 8
  CoreEnabled: 8
  Status: Enabled
  MaxSpeed: 4000
  Signature: Type 0, Family 6, Model 85, Stepping 4

(12) DIMM's installed:
  Speed: 2933
  MemoryType: DDR4
  MemoryTypeDetail: Synchronous Registered (Buffered)
  BankLocator: NODE 0 CHANNEL 0 DIMM 0
  Manufacturer: 0xCE00
  AssetTag: 031922 (Mfg Location:0x03, Mfg Year:19, Mfg Week:22)
  PartNumber: M393A2K40CB2-CVF
  Visibility: Yes
  Operability: Operable
  DataWidth: 64 bits

PCI
  Slot: 5
  PID: NA
  VendorID: 0x8086
  DeviceID: 0x1583
  SubVendorID: 0x1137
  SubDeviceID: 0x013c
  OptionROMLoaded: 0
  ProductName: Intel XL710-QDA2 Dual Port 40Gb QSFP converged NIC
  fwVersion: 0x8000CC14-1.826.0

  Slot: 6
  PID: NA
  VendorID: 0x8086
  DeviceID: 0x2701
  SubVendorID: 0x8086
  SubDeviceID: 0x3904
  OptionROMLoaded: 0
  ProductName: Intel Mass storage controller
  fwVersion: N/A

  Slot: MRAID
  PID: NA
  VendorID: 0x1000
  DeviceID: 0x00ae
  SubVendorID: 0x1137
  SubDeviceID: 0x0211
  OptionROMLoaded: 0
  ProductName: Cisco 12G Modular SAS HBA (max 26 drives)
  fwVersion: 20.0.2.1

  Slot: L
  PID: NA
  VendorID: 0x8086
  DeviceID: 0x1563
  SubVendorID: 0x1137
  SubDeviceID: 0x01a4
  OptionROMLoaded: 0
  ProductName: Intel X550 LOM
  fwVersion: 0x800016F9-1.826.0

(2) of these:
  deviceID: PSU1
  Status: Present
  Status2: equipped
  InputWatt: 10
  Power: on
  OutputMaxWatt: N/A
  OutputWatt: 0
  fwVersion: 10052023
  Model: PS-2771-1S-LF
  MfgRev: 06
  ProductID: UCSC-PSU1-770W

These are the (2) problem SSD's
Western Digital 2TB WD Red SA500 NAS
Product ID: WDC  WDS200T1R0A-68A4W0

(2) of these as boot drives. No problems from them.
  Controller: MRAID
  Description: UNKNOWN
  PID: UNKNOWN
  Vendor: ATA
  Model: INTEL SSDSA2BZ200G3
  powMin: UNKNOWN
  powMax: UNKNOWN

 

5 Replies 5

This gets even stranger. Here is what the CIMC is reporting.

ElliotDierksen_0-1699551595883.png

Since there are no flames shooting out of my server, I feel pretty confident that the drives aren't actually operating at 1 million degrees. I saw this article on Reddit talking about this a year ago.

https://www.reddit.com/r/homelab/comments/xru0ct/cisco_ucsc240m5_reading_smart_temp_incorrectly/ 

I suppose I could maybe understand this if I were using SSD's from Joe's used auto parts and disk drives, but these are WD ones intended for use in a NAS. The Reddit poster was using Seagate IronWolf , so that was also a quality drive. What the heck?

The CIMCs are designed for and optimized for gear on the Server's spec sheet.  The CIMC's capability catalog has profiles for the 'approved' gear from the spec sheet, and knows what the thermal/cooling requirements are.

Any time you get gear that is not Cisco specific or has  customized Cisco firmware, you have the possibility of mismatched polling values or the way the field values are processed..  My guess is that the CIMC is reading some other smart info field other than the actual temp one, or the non-cisco drive isn't presenting the temp field in the way the Cisco PID'd drives would.

If your installed OS supports smartctl, I'd be tempted to run some smartctl commands against those drives and see if you get any matches for the "1769504" value, and see what field that really is.

Kirk...

Here is one of them.

=== START OF INFORMATION SECTION ===
Model Family:     WD Blue / Red / Green SSDs
Device Model:     WDC  WDS200T1R0A-68A4W0
LU WWN Device Id: 5 001b44 8bf330887
Firmware Version: 411010WR
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Nov 14 09:31:09 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       8
165 Block_Erase_Count       0x0032   100   100   ---    Old_age   Always       -       0
166 Minimum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       0
167 Max_Bad_Blocks_per_Die  0x0032   100   100   ---    Old_age   Always       -       72
168 Maximum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       0
169 Total_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       1046
170 Grown_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       0
171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
173 Average_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       0
174 Unexpected_Power_Loss   0x0032   100   100   ---    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   073   032   ---    Old_age   Always       -       27 (Min/Max 25/32)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
230 Media_Wearout_Indicator 0x0032   001   001   ---    Old_age   Always       -       0x000000000000
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 NAND_GB_Written_TLC     0x0032   100   100   ---    Old_age   Always       -       0
234 NAND_GB_Written_SLC     0x0032   100   100   ---    Old_age   Always       -       0
241 Host_Writes_GiB         0x0030   253   253   ---    Old_age   Offline      -       0
242 Host_Reads_GiB          0x0030   253   253   ---    Old_age   Offline      -       0
244 Temp_Throttle_Status    0x0032   000   100   ---    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         0         -
# 2  Short offline       Completed without error       00%         0         -
# 3  Short offline       Completed without error       00%         0         -

Selective Self-tests/Logging not supported

 I don't see anything in there that matches the 1769504 number.

Steven Tardy
Cisco Employee
Cisco Employee

I saw this thread the other day and sent a link to someone in UCS engineering.

  • 1769504 decimal is 0x1B0020 hex.
  • 0x1B hex is 27 decimal.
  • 0x20 hex is 32 decimal.
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
..
194 Temperature_Celsius     0x0022   073   032   ---    Old_age   Always       -       27 (Min/Max 25/32)

Don't know for certain, but seems these values are getting read incorrectly.

Typically I would say open a TAC case and have TAC open a CDET to get this resolved, but if you are not using "Cisco certified" drives, then I'm pretty sure UCS engineering will consider this an "unsupported" use case and never fix the CDET nor underlying issue.

May be worth a TAC case just so the CDET can get created and investigated to verify the issue and documented for others.

Thanks, I hadn't thought of the decimal to hex conversion. This is my home lab, so TAC isn't really an option. I suspect you are correct about the response being "unsupported" even if I did open a TAC case. I am starting down the road of using an LSI 9300 internal PCI card for those drives, but that requires me to replace the internal SAS cables. This is proving to be a harder upgrade than I was expecting. Sigh....

Review Cisco Networking for a $25 gift card