Latency issues with a B200-M3

mvanopens · ‎02-29-2016

Hi there,

I'm looking for some help with trying to figure out how I can pin down a failing SSD in a RAID 10...preferably remotely as the datacenter is an hour and a half away from me.

A week ago or so we began to get some I/O Latency errors on a local VMware datastore. Things like this:

Device <local datastore> performance has deteriorated. I/O latency increased from average value of 201 microseconds to 7209 microseconds.

Then the latency would back off to a more workable level only to trip a threshold again later. This datastore has two SSDs in a RAID 10 configuration and is being used to as the ESXi OS boot partition and a <100GB local datastore with one VM on it (vShield Endpoint compatible Anti-Virus scanner). Now it's pretty safe to say that one VM should not be thrashing an SSD under normal conditions so I'm trying to figure out how I can see if one of my disks are failing even if UCSM doesn't see it as such.

Thank you for your time and help.

Kirk J · ‎02-29-2016

Greetings.

Am assuming your SEL log doesn't say anything about 'SAS' type alerts?

LSI does have a couple of utilities (storcli and lsi-get) that can be run from within the esxi cli, that pulls firmware term logs from the raid controller.

What specific SSD drives are in use? Not all SSD drives have superfast throughput, especially when it comes to writes, which especially get worse when the drive(s) is fairly full

From datastore/storage performance metrics, are both read and write running high latencies?

You might want to open a TAC case to get help looking at the storage related logs.

Thanks,

Kirk...

mvanopens · ‎02-29-2016

Thanks Kirk,

Correct I couldn't find any alerts in the SEL logs.

They are the 100GB Enterprise drives (UCS-SSD100G0KA2-E) and only has one VM with 60 some odd GB free. We have 10 identical hosts (with the same local datastore configuration and vm count) but only one seems to be experiencing any issues.

From what I can see it's mostly write latencies that spike. While I agree that not all SSDs are equal, any SSD should be able to cover one VM, so I don't think that is the specific issue.

I've worked with a similar issue on another ESXi host and TAC never mentioned the commands you listed which is why I have been hesitant to call them right off the bat. We ended up replacing each disk one at a time to make it go away.

I did find something that said ESXi will throw an I/O alert when ever the percentage changes by 20-30% or more. I'm trying to see if that's a real thing or not. It may be that our issue is related more to a couple of 'big' swings and not an actual drive issue.

Thank you for your time and help

Matt