02-09-2021 03:17 PM
We have 18 UCS servers with each server having 12 disks each, running hadoop (MapR) on these servers. We are getting disk failures reported by MapR fairly often, say once per month. The problem is UCS manager only seems to report the disk as faulty if smoke is physically coming out of the disk. So we have been adding the disk back into MapR and waiting for it to report it as faulty again. I have talked with Cisco support and they certainly aren't keen on RMAing the disks. So question is, at what point is reasonable to replace the disks? I can see errors being reported by MapR and also SMART shows a number of unrecoverable read errors. The number of errors vary from 3 to 140. Should smart data be enough to get a disk replaced? In the future I would like to have the disks replaced before MapR reports them as faulty, so be a bit more proactive basically.
02-09-2021 05:21 PM
What disk controller is in the server (RAID controller, HBA/SAS/pass-through controller)?
UCS is quiet ignorant when it comes to the HBA as there will rarely be faults within UCSM/CIMC.
Occasional read errors are almost a fact-of-life with modern disks and it is up to the application/RAID/etc to correct the data and re-write the correct bits.
Provide OS and LSIGet logs when working with Cisco TAC.
If the OS/application is determining the disk to be unreliable and should be replaced, then the disk should be replaced.
02-10-2021 05:13 PM
Took me a while to find it but this is it: LSI MegaRAID SAS 3108
I agree that it makes sense for the application to handle disk errors, and it does seem to do that to a point. I think it reaches some sort of limit and then throws the disk out. If I run "smartctl --all /dev/sdX" I can see read/write errors. Some disks have 1 single error, eg the far right column is unrecoverable read errors. Is there some sort of limit that Cisco accept as a disk being faulty?
Device: /dev/sdd
read: 0 223101 0 223101 12952681 120362.541 1
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide