cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

UCS Blade ECC error alerting?

bryanbrooks1
Beginner
Beginner

Good afternoon,  does anyone happen to know how to manage alerting from SEL logs? I can’t get a straight answer from Cisco. For an example we have blades crash due to ECC errors.

Error Given below:

ff | 03/05/2015 07:40:34 | CIMC | Memory DDR3_P2_E0_ECC #0x7e | | read 406 correctable ECC errors on CPU2 DIMM E0 | Asserted

We occasionally get some bad RAM or MB of  a blade. These alerts start happening in the SEL log of the blade and eventually the blade crashes in a matter of time. Could be 2 hours or up to 8 hours. No alerts from UCS alerting. The ESXi host will eventually crash and as expected causes our VMs to failover and causes minor outages.  We have over 300+ blades to manage and this is causing a nightmare in our environment. Any thoughts or suggestions. The built in alerting doesn't seem to go into this deep of detail in alerting of the SEL logs. I could have proactively caught this when they first ECC errors started happening and moved my vms and took the blade into maintenance. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Bryan,

I dont think you can filter just regular ECC errors as they are correctable and an expected behavior of DIMMs. ECC errors take place when there is a DIMM single bit error, meaning that a 0 was turn into a 1 (or viceversa) in the binary communication, so that is corrected and the ECC is gone (that is normal behavior); now, you should be seing hundreds of those errors a day cause there is a ECC limit that when reached, UECC (Uncorrectable ECC) errors take place and that, most of the time, when the DIMMs become inoperable.

When the DIMMs become inoperable, you normally get a message like this:

"DIMM 1/13 on server 5/3 operability: inoperable" 

The above message is the one you can use to setup Call Home so it can alert you, so I would use something like the images attached.

I hope that helps you move your VMs before the server crashes.

Now, in regards to the failed DIMMs and MOBOs, have you had another issue with any of the motherboards replaced so far?

 

-Kenny

View solution in original post

14 REPLIES 14

Walter Dey
Advocate
Advocate

I know this is not the answer you are looking for, but it may help

DIMM Blacklisting

In Cisco UCS Manager, the state of the Dual In-line Memory Module (DIMM) is based on SEL event records. When the BIOS encounters a noncorrectable memory error during memory test execution, the DIMM is marked as faulty. A faulty DIMM is a considered a nonfunctional device.

If you enable DIMM blacklisting, Cisco UCS Manager monitors the memory test execution messages and blacklists any DIMMs that encounter memory errors in the DIMM SPD data. To allow the host to map out any DIMMs that encounter uncorrectable ECC errors.

For Cisco B-Series blade server, the server firmware must be at Release 2.2(1) or a later release

 

Row Hammer Issue

- Industry wide problem with all vendors
- Mostly seen in high performance system frequently accessing  shared memory location
- Row hammer pattern means accessing one or a pair of Target rows multiple times within a single refresh period
- Results in significant charge loss & possible data loss due to increase leakage in Victim rows.

Row Hammer Fix

- Increase Refresh rate (default is 64 msec) -  reduce amount of charge loss, effectively reduce correctable memory error rate, 2x (32ms) is recommended.
- Decrease Patrol Scrub time - frequently checks each memory location to detects & correct errors (20 min)

 

Thank you for your response.  We ordered 32 brand new B200 M3 blades and almost half had a manufacturing defect (Cisco verified) with slot E0 and Cisco is only giving us a rolling stock of 4 blades to keep onsite for when these blades crash.  I can not believe that Cisco has no way of alerting of these errors. 

Hi Bryan

Have you seen this

https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=8209&tclass=popup

Walter.

Keny Perez
Collaborator
Collaborator

Bryan,

I have seen TAC cases for memory errors, I think you just need to use the correct filtering to make it work as you want to.

In regards to the errors, are you seeing this ECC errors very frequently on all your B200-M3 servers?  How do you correct the failure? With a memory replacement or the issues clear by themselves after a reboot and then moves over to another slot?

 

-Kenny

To fix the errors has always ended up being replacing the RAM or its a bad DIMM slot on the MB so we have to replace the MB.  

 I can't find a way to make UCS filter only ECC errors and alert me when they fail. 

Bryan,

I dont think you can filter just regular ECC errors as they are correctable and an expected behavior of DIMMs. ECC errors take place when there is a DIMM single bit error, meaning that a 0 was turn into a 1 (or viceversa) in the binary communication, so that is corrected and the ECC is gone (that is normal behavior); now, you should be seing hundreds of those errors a day cause there is a ECC limit that when reached, UECC (Uncorrectable ECC) errors take place and that, most of the time, when the DIMMs become inoperable.

When the DIMMs become inoperable, you normally get a message like this:

"DIMM 1/13 on server 5/3 operability: inoperable" 

The above message is the one you can use to setup Call Home so it can alert you, so I would use something like the images attached.

I hope that helps you move your VMs before the server crashes.

Now, in regards to the failed DIMMs and MOBOs, have you had another issue with any of the motherboards replaced so far?

 

-Kenny

What version of firmware are you on? I don't see the screenshot mem-err.png 

Yes I have had to almost replace 16 MBs due to a manufacture defect.  Our Cisco rep sent us 4 blades as a hotswap as the issue rears its head after the blade bakes for several days/weeks. It's not consistent. 

 

 

Running 2.2.3d.

Have you sent any of those blades to an Engineering Failure Analysis (EFA) so TAC can get to the root cause of the issue?  I mean, we may get into the logs and see if there is any known issue taking place against the blade(s) and if nothing matches then we can sent it for a hardware analysis.

 

-Kenny

Well Cisco has identified a certain batch with the defect but was not certain the exact serial numbers affected. So that's why they sent me some rotating stock. 

ok, that's not normally done so you are lucky :)

I hope the info about Call Home setup helps you; if it does, feel free to rate them :)

 

-Kenny

Thanks for your help!

Yeah we have a good rep.  Still a nightmare getting those spare blades switched out with all the serial stuff ported over. 

We have 10  UCS domains all over the world. It's the plan on getting them all upgraded to 2.2(3d) However the blades affected by this issue are on code 2.1(3c) which doesnt have the memory error as you described.  I did turn it on the 4 we have on 2.2(3d).  Hopefully it will help. 

Great, if you feel like the thread can be marked as answered for future users facing the same challenge, please do so, so that they know you found what you were looking for.

Have a very nice day.

 

-Kenny

We had a Cisco UCSB-B200-M5 new blade lock up due to "uncorrectable ECC" memory error 

VMware did its job with HA configured and restarted ~30 VMs on the surviving cluster members

 

I've sampled the SEL logs (manually exported) and see the "correctable ECC" memory errors are quite common:

 

SANESX43SEL.txt =========================

106 | 05/24/2018 13:52:36 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

107 | 05/28/2018 05:42:41 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

108 | 06/01/2018 18:27:56 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

109 | 06/02/2018 12:17:33 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

10a | 06/02/2018 23:42:28 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

SANESX44SEL.txt =========================

b7 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 6 correctable ECC errors on CPU1 DIMM B2  | Asserted

b8 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 9487 correctable ECC errors on CPU1 DIMM B3  | Asserted

b9 | 03/01/2018 03:30:08 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 548 correctable ECC errors on CPU1 DIMM B3  | Asserted

ba | 03/01/2018 03:33:09 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 11233 correctable ECC errors on CPU1 DIMM B3  | Asserted

bb | 03/01/2018 03:33:31 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 14 correctable ECC errors on CPU1 DIMM B2  | Asserted

SANESX45SEL.txt =========================

b2 | 03/20/2018 16:17:29 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1  | Asserted

b3 | 03/20/2018 19:10:32 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 136 correctable ECC errors on CPU1 DIMM B1  | Asserted

b4 | 03/23/2018 13:31:39 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1  | Asserted

b5 | 03/23/2018 16:24:37 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1  | Asserted

b6 | 03/23/2018 19:17:35 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU1 DIMM A1. | Asserted

SANESX46SEL.txt =========================

c4 | 03/13/2019 23:45:40 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 256 correctable ECC errors on CPU2 DIMM G3  | Asserted

c5 | 03/13/2019 23:50:42 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 509 correctable ECC errors on CPU2 DIMM G3  | Asserted

c6 | 03/13/2019 23:51:04 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30490 correctable ECC errors on CPU2 DIMM G3  | Asserted

c7 | 03/13/2019 23:52:26 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 404 correctable ECC errors on CPU2 DIMM G3  | Asserted

c8 | 03/13/2019 23:52:47 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30603 correctable ECC errors on CPU2 DIMM G3  | Asserted

SANESX47SEL.txt =========================

be | 12/06/2018 22:46:22 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

bf | 12/06/2018 23:29:46 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

c0 | 12/07/2018 01:39:29 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

c1 | 12/07/2018 04:02:47 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

c2 | 12/07/2018 04:32:48 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

 

Can we have the SEL ECC memory error log messages sent to syslog (splunk) or send SNMP traps or otherwise proactively alert us BEFORE the DIMM ECC is uncorrectable?  Allowing proactive evacuation of the ESX host (vMotion all VMs off) before ESX crash 

 

We are already pursuing VMware's new predictive DRS in this scenario - looking for better understanding of the ECC memory correctable -> uncorrectable monitoring / alerting 

 

thanks!

fcocquyt
Beginner
Beginner

We had a Cisco UCSB-B200-M5 new blade lock up due to "uncorrectable ECC" memory error 

VMware did its job with HA configured and restarted ~30 VMs on the surviving cluster members

 

I've sampled the SEL logs (manually exported) and see the "correctable ECC" memory errors are quite common:

 

SANESX43SEL.txt =========================

106 | 05/24/2018 13:52:36 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

107 | 05/28/2018 05:42:41 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

108 | 06/01/2018 18:27:56 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

109 | 06/02/2018 12:17:33 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

10a | 06/02/2018 23:42:28 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1  | Asserted

SANESX44SEL.txt =========================

b7 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 6 correctable ECC errors on CPU1 DIMM B2  | Asserted

b8 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 9487 correctable ECC errors on CPU1 DIMM B3  | Asserted

b9 | 03/01/2018 03:30:08 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 548 correctable ECC errors on CPU1 DIMM B3  | Asserted

ba | 03/01/2018 03:33:09 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 11233 correctable ECC errors on CPU1 DIMM B3  | Asserted

bb | 03/01/2018 03:33:31 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 14 correctable ECC errors on CPU1 DIMM B2  | Asserted

SANESX45SEL.txt =========================

b2 | 03/20/2018 16:17:29 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1  | Asserted

b3 | 03/20/2018 19:10:32 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 136 correctable ECC errors on CPU1 DIMM B1  | Asserted

b4 | 03/23/2018 13:31:39 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1  | Asserted

b5 | 03/23/2018 16:24:37 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1  | Asserted

b6 | 03/23/2018 19:17:35 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU1 DIMM A1. | Asserted

SANESX46SEL.txt =========================

c4 | 03/13/2019 23:45:40 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 256 correctable ECC errors on CPU2 DIMM G3  | Asserted

c5 | 03/13/2019 23:50:42 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 509 correctable ECC errors on CPU2 DIMM G3  | Asserted

c6 | 03/13/2019 23:51:04 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30490 correctable ECC errors on CPU2 DIMM G3  | Asserted

c7 | 03/13/2019 23:52:26 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 404 correctable ECC errors on CPU2 DIMM G3  | Asserted

c8 | 03/13/2019 23:52:47 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30603 correctable ECC errors on CPU2 DIMM G3  | Asserted

SANESX47SEL.txt =========================

be | 12/06/2018 22:46:22 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

bf | 12/06/2018 23:29:46 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

c0 | 12/07/2018 01:39:29 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

c1 | 12/07/2018 04:02:47 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

c2 | 12/07/2018 04:32:48 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1  | Asserted

 

Can we have the SEL ECC memory error log messages sent to syslog (splunk) or send SNMP traps or otherwise proactively alert us BEFORE the DIMM ECC is uncorrectable?  Allowing proactive evacuation of the ESX host (vMotion all VMs off) before ESX crash 

 

We are already pursuing VMware's new predictive DRS in this scenario - looking for better understanding of the ECC memory correctable -> uncorrectable monitoring / alerting 

 

thanks!

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: