11-09-2023 07:13 AM - edited 11-09-2023 07:14 AM
I am in the process of migrating my lab from M4 to M5 servers, and I have hit a snag. The problem is with my main NAS which is running TrueNAS 13.0. As part of the upgrade I ordered Western Digital 2TB WD Red SA500 NAS drives to replace the SATA HDD's in the old unit. I installed 2 of the brand new drives and after a few moments the fans go to full speed and the HBA shows those 2 drives as over temperature. It seems awfully unlikely that I got 2 DOA drives. I was able to create a ZFS pool on them, so the OS thinks they are performing correctly. BIOS version: C240M5.4.3.2a.0.0613231011, CIMC version: 4.2(3h).
Here are all the particulars.
Chassis Type : Rack Mount Chassis
Chassis Part Number : 74-105778-03
Chassis Extra : Cisco Systems Inc
Board Mfg : Cisco Systems Inc
Board Product : UCSC-C240-M5SX
Board Part Number : 74-105773-01
Board Extra : D05V03
Board Extra : 0000000000
Product Manufacturer : Cisco Systems Inc
Product Name : UCSC-C240-M5SX
Product Part Number : 74-105778-03
Product Version : A0
Product Asset Tag : Unknown
Product Extra : D0V03
CPU
id: 1
SocketDesignation: CPU1
ProcessorManufacturer: Intel(R) Corporation
ProcessorFamily: Xeon
ThreadCount: 8
SerialNumber: Not Specified
ProcessorVersion: Intel(R) Xeon(R) Gold 6144 CPU @ 3.50GHz
CurrentSpeed: 3500
CoreCount: 8
CoreEnabled: 8
Status: Enabled
MaxSpeed: 4000
Signature: Type 0, Family 6, Model 85, Stepping 4
(12) DIMM's installed:
Speed: 2933
MemoryType: DDR4
MemoryTypeDetail: Synchronous Registered (Buffered)
BankLocator: NODE 0 CHANNEL 0 DIMM 0
Manufacturer: 0xCE00
AssetTag: 031922 (Mfg Location:0x03, Mfg Year:19, Mfg Week:22)
PartNumber: M393A2K40CB2-CVF
Visibility: Yes
Operability: Operable
DataWidth: 64 bits
PCI
Slot: 5
PID: NA
VendorID: 0x8086
DeviceID: 0x1583
SubVendorID: 0x1137
SubDeviceID: 0x013c
OptionROMLoaded: 0
ProductName: Intel XL710-QDA2 Dual Port 40Gb QSFP converged NIC
fwVersion: 0x8000CC14-1.826.0
Slot: 6
PID: NA
VendorID: 0x8086
DeviceID: 0x2701
SubVendorID: 0x8086
SubDeviceID: 0x3904
OptionROMLoaded: 0
ProductName: Intel Mass storage controller
fwVersion: N/A
Slot: MRAID
PID: NA
VendorID: 0x1000
DeviceID: 0x00ae
SubVendorID: 0x1137
SubDeviceID: 0x0211
OptionROMLoaded: 0
ProductName: Cisco 12G Modular SAS HBA (max 26 drives)
fwVersion: 20.0.2.1
Slot: L
PID: NA
VendorID: 0x8086
DeviceID: 0x1563
SubVendorID: 0x1137
SubDeviceID: 0x01a4
OptionROMLoaded: 0
ProductName: Intel X550 LOM
fwVersion: 0x800016F9-1.826.0
(2) of these:
deviceID: PSU1
Status: Present
Status2: equipped
InputWatt: 10
Power: on
OutputMaxWatt: N/A
OutputWatt: 0
fwVersion: 10052023
Model: PS-2771-1S-LF
MfgRev: 06
ProductID: UCSC-PSU1-770W
These are the (2) problem SSD's
Western Digital 2TB WD Red SA500 NAS
Product ID: WDC WDS200T1R0A-68A4W0
(2) of these as boot drives. No problems from them.
Controller: MRAID
Description: UNKNOWN
PID: UNKNOWN
Vendor: ATA
Model: INTEL SSDSA2BZ200G3
powMin: UNKNOWN
powMax: UNKNOWN
11-09-2023 09:44 AM - edited 11-10-2023 12:24 PM
This gets even stranger. Here is what the CIMC is reporting.
Since there are no flames shooting out of my server, I feel pretty confident that the drives aren't actually operating at 1 million degrees. I saw this article on Reddit talking about this a year ago.
https://www.reddit.com/r/homelab/comments/xru0ct/cisco_ucsc240m5_reading_smart_temp_incorrectly/
I suppose I could maybe understand this if I were using SSD's from Joe's used auto parts and disk drives, but these are WD ones intended for use in a NAS. The Reddit poster was using Seagate IronWolf , so that was also a quality drive. What the heck?
11-13-2023 05:12 AM - edited 11-13-2023 06:18 AM
The CIMCs are designed for and optimized for gear on the Server's spec sheet. The CIMC's capability catalog has profiles for the 'approved' gear from the spec sheet, and knows what the thermal/cooling requirements are.
Any time you get gear that is not Cisco specific or has customized Cisco firmware, you have the possibility of mismatched polling values or the way the field values are processed.. My guess is that the CIMC is reading some other smart info field other than the actual temp one, or the non-cisco drive isn't presenting the temp field in the way the Cisco PID'd drives would.
If your installed OS supports smartctl, I'd be tempted to run some smartctl commands against those drives and see if you get any matches for the "1769504" value, and see what field that really is.
Kirk...
11-14-2023 06:34 AM
Here is one of them.
=== START OF INFORMATION SECTION ===
Model Family: WD Blue / Red / Green SSDs
Device Model: WDC WDS200T1R0A-68A4W0
LU WWN Device Id: 5 001b44 8bf330887
Firmware Version: 411010WR
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Nov 14 09:31:09 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 8
165 Block_Erase_Count 0x0032 100 100 --- Old_age Always - 0
166 Minimum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 0
167 Max_Bad_Blocks_per_Die 0x0032 100 100 --- Old_age Always - 72
168 Maximum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 0
169 Total_Bad_Blocks 0x0032 100 100 --- Old_age Always - 1046
170 Grown_Bad_Blocks 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Average_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 0
174 Unexpected_Power_Loss 0x0032 100 100 --- Old_age Always - 1
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 073 032 --- Old_age Always - 27 (Min/Max 25/32)
199 UDMA_CRC_Error_Count 0x0032 100 100 --- Old_age Always - 0
230 Media_Wearout_Indicator 0x0032 001 001 --- Old_age Always - 0x000000000000
232 Available_Reservd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 NAND_GB_Written_TLC 0x0032 100 100 --- Old_age Always - 0
234 NAND_GB_Written_SLC 0x0032 100 100 --- Old_age Always - 0
241 Host_Writes_GiB 0x0030 253 253 --- Old_age Offline - 0
242 Host_Reads_GiB 0x0030 253 253 --- Old_age Offline - 0
244 Temp_Throttle_Status 0x0032 000 100 --- Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 0 -
# 2 Short offline Completed without error 00% 0 -
# 3 Short offline Completed without error 00% 0 -
Selective Self-tests/Logging not supported
I don't see anything in there that matches the 1769504 number.
11-15-2023 07:28 AM
I saw this thread the other day and sent a link to someone in UCS engineering.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE .. 194 Temperature_Celsius 0x0022 073 032 --- Old_age Always - 27 (Min/Max 25/32)
Don't know for certain, but seems these values are getting read incorrectly.
Typically I would say open a TAC case and have TAC open a CDET to get this resolved, but if you are not using "Cisco certified" drives, then I'm pretty sure UCS engineering will consider this an "unsupported" use case and never fix the CDET nor underlying issue.
May be worth a TAC case just so the CDET can get created and investigated to verify the issue and documented for others.
11-15-2023 08:03 AM
Thanks, I hadn't thought of the decimal to hex conversion. This is my home lab, so TAC isn't really an option. I suspect you are correct about the response being "unsupported" even if I did open a TAC case. I am starting down the road of using an LSI 9300 internal PCI card for those drives, but that requires me to replace the internal SAS cables. This is proving to be a harder upgrade than I was expecting. Sigh....
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide