Solved: Re: UCS Blade ECC error alerting?

bryanbrooks1 · ‎03-05-2015

Good afternoon, does anyone happen to know how to manage alerting from SEL logs? I can’t get a straight answer from Cisco. For an example we have blades crash due to ECC errors.

Error Given below:

We occasionally get some bad RAM or MB of a blade. These alerts start happening in the SEL log of the blade and eventually the blade crashes in a matter of time. Could be 2 hours or up to 8 hours. No alerts from UCS alerting. The ESXi host will eventually crash and as expected causes our VMs to failover and causes minor outages. We have over 300+ blades to manage and this is causing a nightmare in our environment. Any thoughts or suggestions. The built in alerting doesn't seem to go into this deep of detail in alerting of the SEL logs. I could have proactively caught this when they first ECC errors started happening and moved my vms and took the blade into maintenance. Thanks!

Keny Perez · ‎03-10-2015

Bryan,

I dont think you can filter just regular ECC errors as they are correctable and an expected behavior of DIMMs. ECC errors take place when there is a DIMM single bit error, meaning that a 0 was turn into a 1 (or viceversa) in the binary communication, so that is corrected and the ECC is gone (that is normal behavior); now, you should be seing hundreds of those errors a day cause there is a ECC limit that when reached, UECC (Uncorrectable ECC) errors take place and that, most of the time, when the DIMMs become inoperable.

When the DIMMs become inoperable, you normally get a message like this:

"DIMM 1/13 on server 5/3 operability: inoperable"

The above message is the one you can use to setup Call Home so it can alert you, so I would use something like the images attached.

I hope that helps you move your VMs before the server crashes.

Now, in regards to the failed DIMMs and MOBOs, have you had another issue with any of the motherboards replaced so far?

-Kenny

View solution in original post

Walter Dey · ‎03-05-2015

I know this is not the answer you are looking for, but it may help

DIMM Blacklisting

In Cisco UCS Manager, the state of the Dual In-line Memory Module (DIMM) is based on SEL event records. When the BIOS encounters a noncorrectable memory error during memory test execution, the DIMM is marked as faulty. A faulty DIMM is a considered a nonfunctional device.

If you enable DIMM blacklisting, Cisco UCS Manager monitors the memory test execution messages and blacklists any DIMMs that encounter memory errors in the DIMM SPD data. To allow the host to map out any DIMMs that encounter uncorrectable ECC errors.

For Cisco B-Series blade server, the server firmware must be at Release 2.2(1) or a later release

Row Hammer Issue

- Industry wide problem with all vendors
- Mostly seen in high performance system frequently accessing shared memory location
- Row hammer pattern means accessing one or a pair of Target rows multiple times within a single refresh period
- Results in significant charge loss & possible data loss due to increase leakage in Victim rows.

Row Hammer Fix

- Increase Refresh rate (default is 64 msec) - reduce amount of charge loss, effectively reduce correctable memory error rate, 2x (32ms) is recommended.
- Decrease Patrol Scrub time - frequently checks each memory location to detects & correct errors (20 min)

bryanbrooks1 · ‎03-05-2015

Thank you for your response. We ordered 32 brand new B200 M3 blades and almost half had a manufacturing defect (Cisco verified) with slot E0 and Cisco is only giving us a rolling stock of 4 blades to keep onsite for when these blades crash. I can not believe that Cisco has no way of alerting of these errors.

Walter Dey · ‎03-05-2015

Hi Bryan

Have you seen this

https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=8209&tclass=popup

Walter.

Keny Perez · ‎03-10-2015

Bryan,

I have seen TAC cases for memory errors, I think you just need to use the correct filtering to make it work as you want to.

In regards to the errors, are you seeing this ECC errors very frequently on all your B200-M3 servers? How do you correct the failure? With a memory replacement or the issues clear by themselves after a reboot and then moves over to another slot?

-Kenny

bryanbrooks1 · ‎03-10-2015

To fix the errors has always ended up being replacing the RAM or its a bad DIMM slot on the MB so we have to replace the MB.

I can't find a way to make UCS filter only ECC errors and alert me when they fail.

Keny Perez · ‎03-10-2015

Bryan,

I dont think you can filter just regular ECC errors as they are correctable and an expected behavior of DIMMs. ECC errors take place when there is a DIMM single bit error, meaning that a 0 was turn into a 1 (or viceversa) in the binary communication, so that is corrected and the ECC is gone (that is normal behavior); now, you should be seing hundreds of those errors a day cause there is a ECC limit that when reached, UECC (Uncorrectable ECC) errors take place and that, most of the time, when the DIMMs become inoperable.

When the DIMMs become inoperable, you normally get a message like this:

"DIMM 1/13 on server 5/3 operability: inoperable"

The above message is the one you can use to setup Call Home so it can alert you, so I would use something like the images attached.

I hope that helps you move your VMs before the server crashes.

Now, in regards to the failed DIMMs and MOBOs, have you had another issue with any of the motherboards replaced so far?

-Kenny

bryanbrooks1 · ‎03-10-2015

What version of firmware are you on? I don't see the screenshot mem-err.png

Yes I have had to almost replace 16 MBs due to a manufacture defect. Our Cisco rep sent us 4 blades as a hotswap as the issue rears its head after the blade bakes for several days/weeks. It's not consistent.

Keny Perez · ‎03-10-2015

Running 2.2.3d.

Have you sent any of those blades to an Engineering Failure Analysis (EFA) so TAC can get to the root cause of the issue? I mean, we may get into the logs and see if there is any known issue taking place against the blade(s) and if nothing matches then we can sent it for a hardware analysis.

-Kenny

bryanbrooks1 · ‎03-10-2015

Well Cisco has identified a certain batch with the defect but was not certain the exact serial numbers affected. So that's why they sent me some rotating stock.

Keny Perez · ‎03-10-2015

ok, that's not normally done so you are lucky :)

I hope the info about Call Home setup helps you; if it does, feel free to rate them :)

-Kenny

bryanbrooks1 · ‎03-10-2015

Thanks for your help!

Yeah we have a good rep. Still a nightmare getting those spare blades switched out with all the serial stuff ported over.

We have 10 UCS domains all over the world. It's the plan on getting them all upgraded to 2.2(3d) However the blades affected by this issue are on code 2.1(3c) which doesnt have the memory error as you described. I did turn it on the 4 we have on 2.2(3d). Hopefully it will help.

Keny Perez · ‎03-10-2015

Great, if you feel like the thread can be marked as answered for future users facing the same challenge, please do so, so that they know you found what you were looking for.

Have a very nice day.

-Kenny

fcocquyt · ‎04-09-2019

We had a Cisco UCSB-B200-M5 new blade lock up due to "uncorrectable ECC" memory error

VMware did its job with HA configured and restarted ~30 VMs on the surviving cluster members

I've sampled the SEL logs (manually exported) and see the "correctable ECC" memory errors are quite common:

SANESX43SEL.txt =========================

SANESX44SEL.txt =========================

SANESX45SEL.txt =========================

SANESX46SEL.txt =========================

SANESX47SEL.txt =========================

Can we have the SEL ECC memory error log messages sent to syslog (splunk) or send SNMP traps or otherwise proactively alert us BEFORE the DIMM ECC is uncorrectable? Allowing proactive evacuation of the ESX host (vMotion all VMs off) before ESX crash

We are already pursuing VMware's new predictive DRS in this scenario - looking for better understanding of the ECC memory correctable -> uncorrectable monitoring / alerting

thanks!

fcocquyt · ‎04-09-2019

We had a Cisco UCSB-B200-M5 new blade lock up due to "uncorrectable ECC" memory error

VMware did its job with HA configured and restarted ~30 VMs on the surviving cluster members

I've sampled the SEL logs (manually exported) and see the "correctable ECC" memory errors are quite common:

SANESX43SEL.txt =========================

SANESX44SEL.txt =========================

SANESX45SEL.txt =========================

SANESX46SEL.txt =========================

SANESX47SEL.txt =========================

Can we have the SEL ECC memory error log messages sent to syslog (splunk) or send SNMP traps or otherwise proactively alert us BEFORE the DIMM ECC is uncorrectable? Allowing proactive evacuation of the ESX host (vMotion all VMs off) before ESX crash

We are already pursuing VMware's new predictive DRS in this scenario - looking for better understanding of the ECC memory correctable -> uncorrectable monitoring / alerting

thanks!