cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
605
Views
0
Helpful
2
Replies

When to replace disks

mikekulls
Level 1
Level 1

We have 18 UCS servers with each server having 12 disks each, running hadoop (MapR) on these servers. We are getting disk failures reported by MapR fairly often, say once per month. The problem is UCS manager only seems to report the disk as faulty if smoke is physically coming out of the disk. So we have been adding the disk back into MapR and waiting for it to report it as faulty again. I have talked with Cisco support and they certainly aren't keen on RMAing the disks. So question is, at what point is reasonable to replace the disks? I can see errors being reported by MapR and also SMART shows a number of unrecoverable read errors. The number of errors vary from 3 to 140. Should smart data be enough to get a disk replaced? In the future I would like to have the disks replaced before MapR reports them as faulty, so be a bit more proactive basically.

2 Replies 2

Steven Tardy
Cisco Employee
Cisco Employee

What disk controller is in the server (RAID controller, HBA/SAS/pass-through controller)?

UCS is quiet ignorant when it comes to the HBA as there will rarely be faults within UCSM/CIMC.

Occasional read errors are almost a fact-of-life with modern disks and it is up to the application/RAID/etc to correct the data and re-write the correct bits.

Provide OS and LSIGet logs when working with Cisco TAC.

If the OS/application is determining the disk to be unreliable and should be replaced, then the disk should be replaced.

Took me a while to find it but this is it: LSI MegaRAID SAS 3108

 

I agree that it makes sense for the application to handle disk errors, and it does seem to do that to a point. I think it reaches some sort of limit and then throws the disk out. If I run "smartctl --all /dev/sdX" I can see read/write errors. Some disks have 1 single error, eg the far right column is unrecoverable read errors. Is there some sort of limit that Cisco accept as a disk being faulty?

 

Device: /dev/sdd
read: 0 223101 0 223101 12952681 120362.541 1

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card