cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
4805
Views
5
Helpful
6
Replies

DIMM becoming inoperable

mepancha
Level 1
Level 1

Hi,

In UCS B200-M2 blades, DIMM becoming inoperable/degraded with cause "equipment-inoperable". Server running on that blade goes in to hang/degraded state.

When I reset them physically or from command line they are coming back to operable state. But in any case server needs to be restarted.

Two questions:

So how to find out root cause of DIMM becoming inoperable ?

And how to clear these kind of error without affecting running server on that blade ?

6 Replies 6

Mathew Lewit
Cisco Employee
Cisco Employee

You may want to review this field notice and make sure it is not the problem.

http://www.cisco.com/en/US/ts/fn/633/fn63387.html

Thank you,

Hi Mathew,

Thanks a lot for replying me. It looks like I may have to upgrade UCS software. Let me try the solution provided on that link.

Mehul

kevin.goodman
Level 1
Level 1

What version of firmware are you running?  We ran into "Uncorrectable ECC/other uncorrectable memory error" errors when we were on v1.2.  The servers were running RedHat Linux and had no downtime due to the errors.  Turned out it was an alert threshold for the low-power DIMMs on the blades.  What we ran into is here:

http://blog.colovirt.com/2010/06/16/hardware-cisco-ucs-memory-bug-b250-blades/

Luckily the issue was resolved in version 1.3.   We have seen a few different DIMM errors in our UCS systems, but luckily they have not caused us any downtime

Kevin Goodman

http://blog.colovirt.com

kevin@colovirt.com

Hi,

We are running 1.3 (1c). In our case when we have this memory errors, I am able to clear them. But sometime at the time of clearing errors the Operating system running on them becomes unaccessible and it automatically restart as service profile restarts. And at that time we dont have any kind control on that. We are running RHEL5.

I have contacted TAC support and sent them some logs for investigation. I have not heard anything and probably this weekend we will upgrade to higher firmware version if we will not get aany reply from TAC support.

Thanks,

Mehul

The B250 issues you had experienced will likely be different from what has been seen on the B200.  After reviewing the blog post I think you either received incomplete information, information that was thought to be true at the time but later proven to be incorrect or completely incorrect information all together.  The bug CSCtg34032 is for voltage errors only, not DIMM inoperable errors.  You may have also run into voltage errors that caused the TAC engineer to point out CSCtg34032 but it should not have been mentioned as the cause of a DIMM inoperable issue.  Most likely the DIMM inoperable errors you saw were really due to CSCtd37817.  CSCtd37817 is also corrected in 1.3(1c), prior to this version on the B250 if the OS does a warm reboot full channels or half channels of DIMMs may go into an inoperable state and the sel event log will show them all as having uncorrectable ecc or other uncorrectable errors.

mtimm
Cisco Employee
Cisco Employee

As an update, Mehul disabled the C6 option in BIOS which is the workaround for the field notice Mat mentioned and the DIMM errors have not returned at this time according to the short instant message conversation I had with him.  However, it seems he would prefer to not mark this post "Answered" until he updates the firmware/BIOS to get the fix for the Intel C6 processor issue and verify that the DIMM errors do not return.

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card