12-22-2015 03:12 PM - edited 03-01-2019 12:31 PM
Hi,
After upgrading my UCS manager to , a minor appeared on one of our b200 M4 blade servers (running 2.2(5c)) after upgrading the chassis and Fabric Interconnects to 2.2(5c).
the full message is:
Health LED of server 1/6 shows error. Reason: DDR4_P2_H1_ECC:Sensor Threshold Crossed;
ID: 712960
Type: equipment
Cause: health-led-amber
Code: F1237
Does any have any suggestions to resolve this fault as I tried to search the forums, documentation, and googling like crazy with no success? I'm hoping this doesn't just mean we have a bad memory stick. Also I've rebooted this blade 2x just to make sure this wasn't a transient error.
-Yahpri Maxwell
Solved! Go to Solution.
12-23-2015 04:38 PM
Hello,
It appears you have a bad DIMM. I would open a TAC case or have the DIMM replaced in order to clear this error.
Let me know if you have any questions.
--Wes
12-23-2015 08:02 AM
You are on a late enough firmware I don't think you would be hitting some of the earlier firmware issues that triggered false positives.
The error line you mentioned is for DIMM slot H1 for Processor #2.
As they are correctable memory errors (sometimes called single bit) the system detects them and moves on without impact to the running os. The only time you generally want to be concerned is if it is an Un-correctable (Multibit) error.
You can likely clear the errors for that DIMM via the equipment tab, inventory, Memory, and then click on the H1 DIMM slot. You should see the 'reset memory errors' button:
Rebooting the CIMC from the "recover server' context in equipment view will also likely clear the error condition, which takes about 3-4 minutes for the CIMC to come back up and does a shallow discovery on server, while it's already up and running.
I have seen a few occasions where Vsphere continues to poll the SEL logs (which also log the correctable memory instances), and will view the historical SEL events as a current memory condition in vsphere. If you run into this, clearing the SEL log should alleviate the alert in Vsphere.
If you are running Vsphere, and seem to hit a PSOD during the time frame when you had some correctable (ECC) counters being logged, please see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2140848. This seems to be something VMware is still working on.
Thanks,
Kirk
12-23-2015 09:51 AM
Kirk - Thanks for breaking down the error, possible causes, affected hardware, and offering solutions! Unfortunately resetting memory errors for DIMM slot H1 didn't clear the minor. I ensured I waited about 15 minutes to make sure it didn't need x amount of time to update the system with status. I exited the UCS manager and re-logged in and tried the same reset memory error option to no avail.
I also rebooted the CIMC from Recover Server menu and that seems to have made things worst as now
I see this critical warning and the below major warning.
Health LED of server 1/6 shows error. Reason: DDR4_P2_H1_ECC:Sensor Threshold Crossed;
ID: 733702
Type: equipment
Cause: health-led-amber-blinking
Code: F1236
Original severity: Critical
--------------------
Server 1/6 (service profile:org-root/ls-ileq1esx07_06p) health:inoperable
ID: 733306
Type: equipment
Cause: equipment-inoperable
Code: F0317
Original severity: Major
Any more ideas on how I should proceed?
12-23-2015 04:38 PM
Hello,
It appears you have a bad DIMM. I would open a TAC case or have the DIMM replaced in order to clear this error.
Let me know if you have any questions.
--Wes
03-21-2016 11:58 AM
I had this exact issue as well even after replacing the faulty DIMM. I would recommend running show health-led expand via ssh to get more details of the fault code which will out display sensor ID, name etc… then run the command listed below.
SSH into virtual management IP
scope server x/y
chassis/server # reset-all-memory-errors
chassis/server # commit-buffer
This cleared the error for me. Hope this helps. Thanks
03-21-2016 11:57 AM
I had this exact issue as well even after replacing the faulty DIMM. I would recommend running show health-led expand via ssh to get more details of the fault code which will out display sensor ID, name etc… then run the command listed below.
SSH into virtual management IP
scope server x/y
chassis/server # reset-all-memory-errors
chassis/server # commit-buffer
This cleared the error for me. Hope this helps. Thanks
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide