Solved: Impact of Resolve Slot Issue Mismatch Identity Unestablishable

GeorgePerkins0204 · ‎10-13-2023

For unknown reasons, a blade in a slot has become status "Needs Resolution" with the UCSM errors shown below. Hardware is no longer on SmartNet support, can't call in a new case.

N20-C6508
UCS B200 M4

Warning is being raised and cleared every 20 minutes resulting in continuous alarms.

What will happen if I click on "OK" (see below screenshot)?

To resolve this situation do I need to Decommission/Re-Acknowledge or otherwise force a discovery reset?

What are the consequences of leaving this situation unresolved? There are active workloads on this blade which are scheduled in two weeks to be migrated to newer hardware. It is just bad timing that we can't walk away from this without a continuous alarm.

Thanks.

fabric/server/chassis-2/slot-6

Description:[FSM:STAGE:REMOTE-ERROR]: Result: unidentified-fail Code: ERR-IBMC-fru-retrieval-error Message: Could not get Fru from 7f060206, dn=fabric/server/chassis-2/slot-6(sam:dme:FabricComputeSlotEpIdentify:ExecutePeer)

ID:5339560

Type:fsm

Cause:execute-local-failed

Created at:2023-10-12 17:50:09

Code:F77959

Number of Occurrences:104

Original severity:Warning

Affected object:fabric/server/chassis-2/slot-6

Description:[FSM:STAGE:RETRY:]: identifying a server in 2/6 via CIMC(FSM-STAGE:sam:dme:FabricComputeSlotEpIdentify:ExecutePeer)

ID:5340051

Type:fsm

Cause:execute-local-failed

Created at:2023-10-12 17:50:49

Code:F16519

Number of Occurrences:104

Original severity:Warning

Previous severity:Cleared

bflowers# · ‎10-13-2023

I have looked through a few cases for this issue and it appears it is resolved about half the time by a procedure including:

Disassociate the service profile

Decommission the server

Reset the server slot <- important

Reseat the blade

Reacknowledge the server

Associate the server profile

The other half of the time the issue is resolved by replacing the motherboard which includes the CIMC.

Given the CIMC appears unresponsive to CLI connection attempts, I think I would try resetting the CIMC during a maintenance window, the procedure for that is in the UCS Manager Server Management Guide for your UCSM release.

Because the CIMC is unresponsive to the CLI connection attempts, I don't think clicking ok to resolve the slot issue will help. The problem is with the CIMC to UCS Manager communication on the blade not with the slot.

I don't personally think there i high risk to leaving this issue unresolved for a short period of time until the workloads are migrated, but it's impossible to predict whether this is the beginning of a cascade failure on the motherboard. Try to prioritize the workload migration if possible.

View solution in original post

Brian Morrissey · ‎10-13-2023

It'll depend on what's going on with the CIMC from a log review standpoint, usually if its flapping like that the CIMC might be losing connection to the IOM (memory leak, cimc process stalls, hardware issue etc).

From an SSH session to the FI this might provide some additional info if you let it tail the CIMC's message log until the next alarm:
connect cimc 2/6
messages follow

You can try searching on any critical looking log entries in the bug search toolkit to see if it matches with an existing bug/workaround.

Consequences of what would happen are a bit tough without TAC review of logs to understand why its in that condition. I've seen the host OS be perfectly okay with the CIMC unresponsive but I've also seen host OS/virtual machines hang if they had a dependency on the CIMC like mapped virtual media or SD cards and couldn't access it anymore.

GeorgePerkins0204 · ‎10-13-2023

Thanks for the quick response Brian. Unfortunately, that CIMC is not responding to the connect command:

GHC-1N-FI-A-A# connect cimc 2/6
Trying 127.5.2.6...
telnet: Unable to connect to remote host: No route to host
GHC-1N-FI-A-A#

P.S. I was able to connect to the non-error-condition blade 5 in the same chassis.

bflowers# · ‎10-13-2023