03-05-2015 06:53 AM - edited 03-01-2019 12:03 PM
Good afternoon, does anyone happen to know how to manage alerting from SEL logs? I can’t get a straight answer from Cisco. For an example we have blades crash due to ECC errors.
Error Given below:
ff | 03/05/2015 07:40:34 | CIMC | Memory DDR3_P2_E0_ECC #0x7e | | read 406 correctable ECC errors on CPU2 DIMM E0 | Asserted
We occasionally get some bad RAM or MB of a blade. These alerts start happening in the SEL log of the blade and eventually the blade crashes in a matter of time. Could be 2 hours or up to 8 hours. No alerts from UCS alerting. The ESXi host will eventually crash and as expected causes our VMs to failover and causes minor outages. We have over 300+ blades to manage and this is causing a nightmare in our environment. Any thoughts or suggestions. The built in alerting doesn't seem to go into this deep of detail in alerting of the SEL logs. I could have proactively caught this when they first ECC errors started happening and moved my vms and took the blade into maintenance. Thanks!
Solved! Go to Solution.
03-10-2015 05:53 AM
Bryan,
I dont think you can filter just regular ECC errors as they are correctable and an expected behavior of DIMMs. ECC errors take place when there is a DIMM single bit error, meaning that a 0 was turn into a 1 (or viceversa) in the binary communication, so that is corrected and the ECC is gone (that is normal behavior); now, you should be seing hundreds of those errors a day cause there is a ECC limit that when reached, UECC (Uncorrectable ECC) errors take place and that, most of the time, when the DIMMs become inoperable.
When the DIMMs become inoperable, you normally get a message like this:
"DIMM 1/13 on server 5/3 operability: inoperable"
The above message is the one you can use to setup Call Home so it can alert you, so I would use something like the images attached.
I hope that helps you move your VMs before the server crashes.
Now, in regards to the failed DIMMs and MOBOs, have you had another issue with any of the motherboards replaced so far?
-Kenny
03-05-2015 07:39 AM
I know this is not the answer you are looking for, but it may help
In Cisco UCS Manager, the state of the Dual In-line Memory Module (DIMM) is based on SEL event records. When the BIOS encounters a noncorrectable memory error during memory test execution, the DIMM is marked as faulty. A faulty DIMM is a considered a nonfunctional device.
If you enable DIMM blacklisting, Cisco UCS Manager monitors the memory test execution messages and blacklists any DIMMs that encounter memory errors in the DIMM SPD data. To allow the host to map out any DIMMs that encounter uncorrectable ECC errors.
For Cisco B-Series blade server, the server firmware must be at Release 2.2(1) or a later release
Row Hammer Issue
- Industry wide problem with all vendors
- Mostly seen in high performance system frequently accessing shared memory location
- Row hammer pattern means accessing one or a pair of Target rows multiple times within a single refresh period
- Results in significant charge loss & possible data loss due to increase leakage in Victim rows.
Row Hammer Fix
- Increase Refresh rate (default is 64 msec) - reduce amount of charge loss, effectively reduce correctable memory error rate, 2x (32ms) is recommended.
- Decrease Patrol Scrub time - frequently checks each memory location to detects & correct errors (20 min)
03-05-2015 09:43 AM
Thank you for your response. We ordered 32 brand new B200 M3 blades and almost half had a manufacturing defect (Cisco verified) with slot E0 and Cisco is only giving us a rolling stock of 4 blades to keep onsite for when these blades crash. I can not believe that Cisco has no way of alerting of these errors.
03-05-2015 09:59 AM
Hi Bryan
Have you seen this
https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=8209&tclass=popup
Walter.
03-10-2015 05:34 AM
Bryan,
I have seen TAC cases for memory errors, I think you just need to use the correct filtering to make it work as you want to.
In regards to the errors, are you seeing this ECC errors very frequently on all your B200-M3 servers? How do you correct the failure? With a memory replacement or the issues clear by themselves after a reboot and then moves over to another slot?
-Kenny
03-10-2015 05:34 AM
To fix the errors has always ended up being replacing the RAM or its a bad DIMM slot on the MB so we have to replace the MB.
I can't find a way to make UCS filter only ECC errors and alert me when they fail.
03-10-2015 05:53 AM
Bryan,
I dont think you can filter just regular ECC errors as they are correctable and an expected behavior of DIMMs. ECC errors take place when there is a DIMM single bit error, meaning that a 0 was turn into a 1 (or viceversa) in the binary communication, so that is corrected and the ECC is gone (that is normal behavior); now, you should be seing hundreds of those errors a day cause there is a ECC limit that when reached, UECC (Uncorrectable ECC) errors take place and that, most of the time, when the DIMMs become inoperable.
When the DIMMs become inoperable, you normally get a message like this:
"DIMM 1/13 on server 5/3 operability: inoperable"
The above message is the one you can use to setup Call Home so it can alert you, so I would use something like the images attached.
I hope that helps you move your VMs before the server crashes.
Now, in regards to the failed DIMMs and MOBOs, have you had another issue with any of the motherboards replaced so far?
-Kenny
03-10-2015 06:02 AM
What version of firmware are you on? I don't see the screenshot mem-err.png
Yes I have had to almost replace 16 MBs due to a manufacture defect. Our Cisco rep sent us 4 blades as a hotswap as the issue rears its head after the blade bakes for several days/weeks. It's not consistent.
03-10-2015 07:18 AM
Running 2.2.3d.
Have you sent any of those blades to an Engineering Failure Analysis (EFA) so TAC can get to the root cause of the issue? I mean, we may get into the logs and see if there is any known issue taking place against the blade(s) and if nothing matches then we can sent it for a hardware analysis.
-Kenny
03-10-2015 07:26 AM
Well Cisco has identified a certain batch with the defect but was not certain the exact serial numbers affected. So that's why they sent me some rotating stock.
03-10-2015 08:04 AM
ok, that's not normally done so you are lucky :)
I hope the info about Call Home setup helps you; if it does, feel free to rate them :)
-Kenny
03-10-2015 08:21 AM
Thanks for your help!
Yeah we have a good rep. Still a nightmare getting those spare blades switched out with all the serial stuff ported over.
We have 10 UCS domains all over the world. It's the plan on getting them all upgraded to 2.2(3d) However the blades affected by this issue are on code 2.1(3c) which doesnt have the memory error as you described. I did turn it on the 4 we have on 2.2(3d). Hopefully it will help.
03-10-2015 08:27 AM
Great, if you feel like the thread can be marked as answered for future users facing the same challenge, please do so, so that they know you found what you were looking for.
Have a very nice day.
-Kenny
04-09-2019 08:34 PM
We had a Cisco UCSB-B200-M5 new blade lock up due to "uncorrectable ECC" memory error
VMware did its job with HA configured and restarted ~30 VMs on the surviving cluster members
I've sampled the SEL logs (manually exported) and see the "correctable ECC" memory errors are quite common:
SANESX43SEL.txt =========================
106 | 05/24/2018 13:52:36 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
107 | 05/28/2018 05:42:41 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
108 | 06/01/2018 18:27:56 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
109 | 06/02/2018 12:17:33 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
10a | 06/02/2018 23:42:28 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
SANESX44SEL.txt =========================
b7 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 6 correctable ECC errors on CPU1 DIMM B2 | Asserted
b8 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 9487 correctable ECC errors on CPU1 DIMM B3 | Asserted
b9 | 03/01/2018 03:30:08 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 548 correctable ECC errors on CPU1 DIMM B3 | Asserted
ba | 03/01/2018 03:33:09 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 11233 correctable ECC errors on CPU1 DIMM B3 | Asserted
bb | 03/01/2018 03:33:31 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 14 correctable ECC errors on CPU1 DIMM B2 | Asserted
SANESX45SEL.txt =========================
b2 | 03/20/2018 16:17:29 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1 | Asserted
b3 | 03/20/2018 19:10:32 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 136 correctable ECC errors on CPU1 DIMM B1 | Asserted
b4 | 03/23/2018 13:31:39 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1 | Asserted
b5 | 03/23/2018 16:24:37 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1 | Asserted
b6 | 03/23/2018 19:17:35 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU1 DIMM A1. | Asserted
SANESX46SEL.txt =========================
c4 | 03/13/2019 23:45:40 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 256 correctable ECC errors on CPU2 DIMM G3 | Asserted
c5 | 03/13/2019 23:50:42 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 509 correctable ECC errors on CPU2 DIMM G3 | Asserted
c6 | 03/13/2019 23:51:04 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30490 correctable ECC errors on CPU2 DIMM G3 | Asserted
c7 | 03/13/2019 23:52:26 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 404 correctable ECC errors on CPU2 DIMM G3 | Asserted
c8 | 03/13/2019 23:52:47 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30603 correctable ECC errors on CPU2 DIMM G3 | Asserted
SANESX47SEL.txt =========================
be | 12/06/2018 22:46:22 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
bf | 12/06/2018 23:29:46 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
c0 | 12/07/2018 01:39:29 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
c1 | 12/07/2018 04:02:47 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
c2 | 12/07/2018 04:32:48 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
Can we have the SEL ECC memory error log messages sent to syslog (splunk) or send SNMP traps or otherwise proactively alert us BEFORE the DIMM ECC is uncorrectable? Allowing proactive evacuation of the ESX host (vMotion all VMs off) before ESX crash
We are already pursuing VMware's new predictive DRS in this scenario - looking for better understanding of the ECC memory correctable -> uncorrectable monitoring / alerting
thanks!
04-09-2019 08:30 PM
We had a Cisco UCSB-B200-M5 new blade lock up due to "uncorrectable ECC" memory error
VMware did its job with HA configured and restarted ~30 VMs on the surviving cluster members
I've sampled the SEL logs (manually exported) and see the "correctable ECC" memory errors are quite common:
SANESX43SEL.txt =========================
106 | 05/24/2018 13:52:36 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
107 | 05/28/2018 05:42:41 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
108 | 06/01/2018 18:27:56 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
109 | 06/02/2018 12:17:33 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
10a | 06/02/2018 23:42:28 | CIMC | Memory DDR4_P1_D1_ECC #0x8c | read 1 correctable ECC errors on CPU1 DIMM D1 | Asserted
SANESX44SEL.txt =========================
b7 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 6 correctable ECC errors on CPU1 DIMM B2 | Asserted
b8 | 02/23/2018 15:39:27 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 9487 correctable ECC errors on CPU1 DIMM B3 | Asserted
b9 | 03/01/2018 03:30:08 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 548 correctable ECC errors on CPU1 DIMM B3 | Asserted
ba | 03/01/2018 03:33:09 | CIMC | Memory DDR4_P1_B3_ECC #0x88 | read 11233 correctable ECC errors on CPU1 DIMM B3 | Asserted
bb | 03/01/2018 03:33:31 | CIMC | Memory DDR4_P1_B2_ECC #0x87 | read 14 correctable ECC errors on CPU1 DIMM B2 | Asserted
SANESX45SEL.txt =========================
b2 | 03/20/2018 16:17:29 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1 | Asserted
b3 | 03/20/2018 19:10:32 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 136 correctable ECC errors on CPU1 DIMM B1 | Asserted
b4 | 03/23/2018 13:31:39 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1 | Asserted
b5 | 03/23/2018 16:24:37 | CIMC | Memory DDR4_P1_B1_ECC #0x86 | read 128 correctable ECC errors on CPU1 DIMM B1 | Asserted
b6 | 03/23/2018 19:17:35 | BIOS | Memory #0x02 | DURING RUNTIME: Uncorrectable ECC/other uncorrectable memory error | uncorrectable multibit memory error for CPU1 DIMM A1. | Asserted
SANESX46SEL.txt =========================
c4 | 03/13/2019 23:45:40 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 256 correctable ECC errors on CPU2 DIMM G3 | Asserted
c5 | 03/13/2019 23:50:42 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 509 correctable ECC errors on CPU2 DIMM G3 | Asserted
c6 | 03/13/2019 23:51:04 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30490 correctable ECC errors on CPU2 DIMM G3 | Asserted
c7 | 03/13/2019 23:52:26 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 404 correctable ECC errors on CPU2 DIMM G3 | Asserted
c8 | 03/13/2019 23:52:47 | CIMC | Memory DDR4_P2_G3_ECC #0x97 | read 30603 correctable ECC errors on CPU2 DIMM G3 | Asserted
SANESX47SEL.txt =========================
be | 12/06/2018 22:46:22 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
bf | 12/06/2018 23:29:46 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
c0 | 12/07/2018 01:39:29 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
c1 | 12/07/2018 04:02:47 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
c2 | 12/07/2018 04:32:48 | CIMC | Memory DDR4_P1_A1_ECC #0x83 | read 1 correctable ECC errors on CPU1 DIMM A1 | Asserted
Can we have the SEL ECC memory error log messages sent to syslog (splunk) or send SNMP traps or otherwise proactively alert us BEFORE the DIMM ECC is uncorrectable? Allowing proactive evacuation of the ESX host (vMotion all VMs off) before ESX crash
We are already pursuing VMware's new predictive DRS in this scenario - looking for better understanding of the ECC memory correctable -> uncorrectable monitoring / alerting
thanks!
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide