ECC error causes ESXi to hang

Steven Misrack · ‎09-16-2013

Has anyone else experienced this problem?

I had an ECC error yesterday on my B200-M3 blade. The blade recovered properly, but the ESXi (5.1u1) software hung hard.

Is anyone aware of any bugs that might cause this problem?

Thanks,

-Steve

mtimm · ‎09-16-2013

Hi Steven,

Yes there is a voltage regulator issue on the B200-M3 that can cause this. I would suggest opening a case with the TAC to verify if this is indeed what may have happened.

Thanks,

Mike

Steven Misrack · ‎09-16-2013

Thanks Mike,

I already opened a ticket: 627424227, but TAC said there is really nothing they can do about it.

I am waiting for an answer back from VMware.

-Steve

mtimm · ‎09-16-2013

Hi Steven,

My advice would be to have not let the TAC engineer close it. I would not consider that type of a response to be acceptable.

Please feel free to go back and request the SR be reopened if you need assistance in resolving this issue.. I'll contact the TAC case owners manager as well and point to this CSC discussion, so you may get contacted due to that. Just looking over the data in the SR I don't see anything that would allow me to rule out the voltage regulator issue I mentioned before. More info on the Voltage Regulator issue can be found here:

http://www.cisco.com/en/US/ts/fn/636/fn63651.html

Bottom line: Intially the voltage regulator issue manifested itself as parity errors. Then it started to manifest itself as not only parity errros but also parity errors and host hangs (CSCue04360). All of this info is easily accessible to TAC and was well distributed. So I'm a bit confused why you would receive the sort of response you recieved. I guess it is difficult to have 100's of people all speak with the same tone of voice if you will. Also I am making a rather large assumption without looking at much information.

Mike

Steven Misrack · ‎09-16-2013

Hi Mike,

After working with TAC, it was determined because I cleared the ECC (and he could not find the error in the event log) there was nothing else he could do. Because the machine was back up and running I left it to be addressed by VMware to identify.

Now that you have provided me with a new source of the problem, I will be re-opening the SR.

The very very odd part of this was I never got an auto-support (call home) fo the ECC error, so I have no other record of the problem other than the SYSLOG archive.

Thanks,

-Steve

mtimm · ‎09-16-2013

Hi Steven,

While this is odd, when I was going through the data from your system I did not see any faults in UCSM, but it looks like "reset-errors" was used which will recalculate the counters based on the last read values and since the PECI bus hung the value wouldn't have changed and the fault would have gone away. We don't trigger SR opening messages from callhome for DIMM degraded errors even though it appears our documentation incorrectly states that we do (now I get to go bang on some doors about this issue too... grrr.) We do open SR's for DIMM inoperable based on a recommendation I made to the callhome team about the end of 2011 I believe and it went into production in May 2012 but was broken and didn't actually work until later in 2012 but I don't remember the exact timeframe.

Thinking about your failure symptoms some more, one would imagine that if the CPUs of the host could not be communicated with over the PECI bus for an extended period of time, CIMC should flag that as a compute inoperable type of fault of some sort, both the ECC counters and the temperature status of the CPU's were unable to be read by CIMC. However, I don't think we have any such faults at this point nor any plans to implement any. I'll bring it up with some of the TAC escalation folks I communicate with regularly and see what they think. It seems like a reasonable request, however this sort of an issue is actually extremely rare.

As for which types of issues actually generate SR's based on faults, you can check here:

http://www.cisco.com/assets/services/smart-call-home/monitoring-details-for-smartnet-service/

Click on "Monitoring Details for Cisco SMARTnet Service," then "Monitoring Details by Product" and finally "Unified Computing Systems" and under there are the different classifications of faults and which ones open SR's in the currently active version of Smart Call Home. In this case "sam:dme: fltMemoryUnitDegraded" this shows it should have opened a SR if it had been triggered but I don't believe that - at least not based on current TAC SR records for callhome and my own experience having worked on this technology (specifically memory related issues on UCS) for several years in the past.

Mike

Steven Misrack · ‎09-17-2013

Thanks, I was actually curious if there was some sort of heartbeat that was able to identify that VMware was not running and reboot the blade. Something in the back of my mind says there should be something we can set to make this happen.

I was looking at the rest of my B200-M3 blades and they all report the same errors in the log, so I am seeing this problem across my entire farm. I have had to replace a few Memory DIMMs of late, so I assume those may or may not have been real faults.

On every chassis with B200-M3 blades, the "Overall Status" is reporting an orange triangle with Voltage Problem as the status, but this alert is not escalating up to the tops faults, so nothing in the system appears to be aware of it.

I am planning to upgrade to 2.1(1f) by the end of October. I am unable to do it any sooner. Hopefully, this will fix a number of other issues as well.

Steve