10-13-2023 09:19 AM
For unknown reasons, a blade in a slot has become status "Needs Resolution" with the UCSM errors shown below. Hardware is no longer on SmartNet support, can't call in a new case.
Warning is being raised and cleared every 20 minutes resulting in continuous alarms.
What will happen if I click on "OK" (see below screenshot)?
To resolve this situation do I need to Decommission/Re-Acknowledge or otherwise force a discovery reset?
What are the consequences of leaving this situation unresolved? There are active workloads on this blade which are scheduled in two weeks to be migrated to newer hardware. It is just bad timing that we can't walk away from this without a continuous alarm.
Thanks.
fabric/server/chassis-2/slot-6 |
Affected object:fabric/server/chassis-2/slot-6 |
Solved! Go to Solution.
10-13-2023 01:26 PM
I have looked through a few cases for this issue and it appears it is resolved about half the time by a procedure including:
Disassociate the service profile
Decommission the server
Reset the server slot <- important
Reseat the blade
Reacknowledge the server
Associate the server profile
The other half of the time the issue is resolved by replacing the motherboard which includes the CIMC.
Given the CIMC appears unresponsive to CLI connection attempts, I think I would try resetting the CIMC during a maintenance window, the procedure for that is in the UCS Manager Server Management Guide for your UCSM release.
Because the CIMC is unresponsive to the CLI connection attempts, I don't think clicking ok to resolve the slot issue will help. The problem is with the CIMC to UCS Manager communication on the blade not with the slot.
I don't personally think there i high risk to leaving this issue unresolved for a short period of time until the workloads are migrated, but it's impossible to predict whether this is the beginning of a cascade failure on the motherboard. Try to prioritize the workload migration if possible.
10-13-2023 09:58 AM
It'll depend on what's going on with the CIMC from a log review standpoint, usually if its flapping like that the CIMC might be losing connection to the IOM (memory leak, cimc process stalls, hardware issue etc).
From an SSH session to the FI this might provide some additional info if you let it tail the CIMC's message log until the next alarm:
connect cimc 2/6
messages follow
You can try searching on any critical looking log entries in the bug search toolkit to see if it matches with an existing bug/workaround.
Consequences of what would happen are a bit tough without TAC review of logs to understand why its in that condition. I've seen the host OS be perfectly okay with the CIMC unresponsive but I've also seen host OS/virtual machines hang if they had a dependency on the CIMC like mapped virtual media or SD cards and couldn't access it anymore.
10-13-2023 10:59 AM - edited 10-13-2023 11:01 AM
Thanks for the quick response Brian. Unfortunately, that CIMC is not responding to the connect command:
GHC-1N-FI-A-A# connect cimc 2/6 |
P.S. I was able to connect to the non-error-condition blade 5 in the same chassis.
10-13-2023 01:26 PM
I have looked through a few cases for this issue and it appears it is resolved about half the time by a procedure including:
Disassociate the service profile
Decommission the server
Reset the server slot <- important
Reseat the blade
Reacknowledge the server
Associate the server profile
The other half of the time the issue is resolved by replacing the motherboard which includes the CIMC.
Given the CIMC appears unresponsive to CLI connection attempts, I think I would try resetting the CIMC during a maintenance window, the procedure for that is in the UCS Manager Server Management Guide for your UCSM release.
Because the CIMC is unresponsive to the CLI connection attempts, I don't think clicking ok to resolve the slot issue will help. The problem is with the CIMC to UCS Manager communication on the blade not with the slot.
I don't personally think there i high risk to leaving this issue unresolved for a short period of time until the workloads are migrated, but it's impossible to predict whether this is the beginning of a cascade failure on the motherboard. Try to prioritize the workload migration if possible.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide