Solved: M4 F1237 error, sensor threshold crossed

ymaxwell3940 · ‎12-22-2015

Hi,

After upgrading my UCS manager to , a minor appeared on one of our b200 M4 blade servers (running 2.2(5c)) after upgrading the chassis and Fabric Interconnects to 2.2(5c).

the full message is:

Health LED of server 1/6 shows error. Reason: DDR4_P2_H1_ECC:Sensor Threshold Crossed;

ID: 712960

Type: equipment

Cause: health-led-amber

Code: F1237

Does any have any suggestions to resolve this fault as I tried to search the forums, documentation, and googling like crazy with no success? I'm hoping this doesn't just mean we have a bad memory stick. Also I've rebooted this blade 2x just to make sure this wasn't a transient error.

-Yahpri Maxwell

Wes Austin · ‎12-23-2015

Hello,

It appears you have a bad DIMM. I would open a TAC case or have the DIMM replaced in order to clear this error.

Let me know if you have any questions.

--Wes

View solution in original post

Kirk J · ‎12-23-2015

You are on a late enough firmware I don't think you would be hitting some of the earlier firmware issues that triggered false positives.

The error line you mentioned is for DIMM slot H1 for Processor #2.

As they are correctable memory errors (sometimes called single bit) the system detects them and moves on without impact to the running os. The only time you generally want to be concerned is if it is an Un-correctable (Multibit) error.

You can likely clear the errors for that DIMM via the equipment tab, inventory, Memory, and then click on the H1 DIMM slot. You should see the 'reset memory errors' button:

Rebooting the CIMC from the "recover server' context in equipment view will also likely clear the error condition, which takes about 3-4 minutes for the CIMC to come back up and does a shallow discovery on server, while it's already up and running.

I have seen a few occasions where Vsphere continues to poll the SEL logs (which also log the correctable memory instances), and will view the historical SEL events as a current memory condition in vsphere. If you run into this, clearing the SEL log should alleviate the alert in Vsphere.

If you are running Vsphere, and seem to hit a PSOD during the time frame when you had some correctable (ECC) counters being logged, please see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2140848. This seems to be something VMware is still working on.

Thanks,

Kirk

ymaxwell3940 · ‎12-23-2015

Kirk - Thanks for breaking down the error, possible causes, affected hardware, and offering solutions! Unfortunately resetting memory errors for DIMM slot H1 didn't clear the minor. I ensured I waited about 15 minutes to make sure it didn't need x amount of time to update the system with status. I exited the UCS manager and re-logged in and tried the same reset memory error option to no avail.

I also rebooted the CIMC from Recover Server menu and that seems to have made things worst as now

I see this critical warning and the below major warning.

Health LED of server 1/6 shows error. Reason: DDR4_P2_H1_ECC:Sensor Threshold Crossed;

ID: 733702

Type: equipment

Cause: health-led-amber-blinking

Code: F1236

Original severity: Critical

--------------------

Server 1/6 (service profile:org-root/ls-ileq1esx07_06p) health:inoperable

ID: 733306

Type: equipment

Cause: equipment-inoperable

Code: F0317

Original severity: Major

Any more ideas on how I should proceed?

Wes Austin · ‎12-23-2015

Hello,

It appears you have a bad DIMM. I would open a TAC case or have the DIMM replaced in order to clear this error.

Let me know if you have any questions.

--Wes

david.duhaney · ‎03-21-2016

I had this exact issue as well even after replacing the faulty DIMM. I would recommend running show health-led expand via ssh to get more details of the fault code which will out display sensor ID, name etc… then run the command listed below.

SSH into virtual management IP

scope server x/y

chassis/server # reset-all-memory-errors

chassis/server # commit-buffer

This cleared the error for me. Hope this helps. Thanks

david.duhaney · ‎03-21-2016

I had this exact issue as well even after replacing the faulty DIMM. I would recommend running show health-led expand via ssh to get more details of the fault code which will out display sensor ID, name etc… then run the command listed below.

SSH into virtual management IP

scope server x/y

chassis/server # reset-all-memory-errors

chassis/server # commit-buffer

This cleared the error for me. Hope this helps. Thanks