C240 RAID 5 high disk / datastore latency spikes

Henry Lee · ‎08-09-2016

Anyone else seeing intermittent write latency spikes to local datastore with RAID 5 configuration? I'm seeing this behavior on multiple ESX servers, and the spikes surprise me because these servers aren't in production yet and they have very light load (about 4 VMs running idle). The spikes have some correlation to load but they seem to occur most when we power on or reboot our VMs. Constant idle latency is around 12-15 ms, which is acceptable. But the spikes are out of expected performance levels.

VMware ESXi, 6.0.0.

UCSC-C240-M4SX

Kirk J · ‎08-09-2016

Greetings.

I would expect to some some spikes, during both shutdown, and especially startup of guest VMs due to burst of IO you would get with OS loading, initializing page file, etc.

On the 12Gb SAS cards, the default write caching mode is write-through, so wanted to confirm if you are using hard drives (vs SSD) that your VD for your raid5 is set to write-back with good BBU. This is assuming you have one of the 12Gb LSI based SAS raid controller models in use, and has a superCap. Your average write latency is 1.822ms which is good, although you'll need some load to see what that is long term.

If you start all 4 VMs up at a relatively close time frame, I would expect you to see some spikes.

Things like faster drive spindle RPMs, SSDs, can reduce access times, increase IOPs, and reduce latency.

Thanks,

Kirk....

Henry Lee · ‎08-09-2016

Thanks for the reply Kirk. It's a MegaRAID SAS with hard drives. I'll check the configuration you suggested and reply if there's anything to add. I understand there would be some spikes but 90ms is surprising, and actually it regularly spikes much higher. For example, right now I'm seeing 2 spikes in the last hour that are 350 and 450ms.

I see this data sheet agrees with your response.

http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/whitepaper-C11-734798.html

So I'll make sure the "Write Back w/BBU" setting is selected.

Walter Dey · ‎08-09-2016

Hi Henry

I have seen this ! and had a TAC case (P1), which took several days to solve (I hope).

Customer was also loosing the datastore ! do you see this as well ?

The suspect part is the Raid Controller !

Field Notice: FN - 63732 - UCS Product Family - LSI RAID Controller Impacted by Several Critical Issues - Replacement Required

RMA of Raid Controller caused a ton of other Problems.

Walter.

Kirk J · ‎08-10-2016

HI Walter.

That FN applies to a small subset of 6Gb 92xx line of controllers.

Henry has a 12Gb Raid card (assuming pid UCSC-MRAID12G).

I agree, if he is seeing datastore disconnects for local storage, then the local storage system is suspect.

Should probably check the UCS interop matrix to make sure we're good on the driver.

For 12G SAS/93XX controllers, use lsi_mr3 drivers except when using VSAN, then use megaraid_sas driver for the 12G SAS/93XX controllers.

If Henry sees extended periods of spikes, when the guest VMs are running dormant, than he may want to open a TAC case and have them look at the raid controller logs.

Thanks,

Kirk..