cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1350
Views
3
Helpful
3
Replies

Impact of Resolve Slot Issue Mismatch Identity Unestablishable

For unknown reasons, a blade in a slot has become status "Needs Resolution" with the UCSM errors shown below. Hardware is no longer on SmartNet support, can't call in a new case.  

  • N20-C6508
  • UCS B200 M4 

Warning is being raised and cleared every 20 minutes resulting in continuous alarms.

What will happen if I click on "OK" (see below screenshot)?

To resolve this situation do I need to Decommission/Re-Acknowledge or otherwise force a discovery reset?

What are the consequences of leaving this situation unresolved? There are active workloads on this blade which are scheduled in two weeks to be migrated to newer hardware. It is just bad timing that we can't walk away from this without a continuous alarm.

Thanks.

 

fabric/server/chassis-2/slot-6
Description:[FSM:STAGE:REMOTE-ERROR]: Result: unidentified-fail Code: ERR-IBMC-fru-retrieval-error Message: Could not get Fru from 7f060206, dn=fabric/server/chassis-2/slot-6(sam:dme:FabricComputeSlotEpIdentify:ExecutePeer)
ID:5339560
Type:fsm
Cause:execute-local-failed
Created at:2023-10-12 17:50:09
Code:F77959
Number of Occurrences:104
Original severity:Warning
Affected object:fabric/server/chassis-2/slot-6
Description:[FSM:STAGE:RETRY:]: identifying a server in 2/6 via CIMC(FSM-STAGE:sam:dme:FabricComputeSlotEpIdentify:ExecutePeer)
ID:5340051
Type:fsm
Cause:execute-local-failed
Created at:2023-10-12 17:50:49
Code:F16519
Number of Occurrences:104
Original severity:Warning
Previous severity:Cleared

NeedsResolutionSlot6.jpgResolveSlot6Issue.jpg

 

 

1 Accepted Solution

Accepted Solutions

bflowers#
Cisco Employee
Cisco Employee

I have looked through a few cases for this issue and it appears it is resolved about half the time by a procedure including:

Disassociate the service profile

Decommission the server

Reset the server slot <- important

Reseat the blade

Reacknowledge the server

Associate the server profile

The other half of the time the issue is resolved by replacing the motherboard which includes the CIMC.

Given the CIMC appears unresponsive to CLI connection attempts, I think I would try resetting the CIMC during a maintenance window, the procedure for that is in the UCS Manager Server Management Guide for your UCSM release.

Because the CIMC is unresponsive to the CLI connection attempts, I don't think clicking ok to resolve the slot issue will help. The problem is with the CIMC to UCS Manager communication on the blade not with the slot.

I don't personally think there i high risk to leaving this issue unresolved for a short period of time until the workloads are migrated, but it's impossible to predict whether this is the beginning of a cascade failure on the motherboard. Try to prioritize the workload migration if possible.

View solution in original post

3 Replies 3

Brian Morrissey
Cisco Employee
Cisco Employee

It'll depend on what's going on with the CIMC from a log review standpoint, usually if its flapping like that the CIMC might be losing connection to the IOM (memory leak, cimc process stalls, hardware issue etc).

From an SSH session to the FI this might provide some additional info if you let it tail the CIMC's message log until the next alarm:
connect cimc 2/6
messages follow

You can try searching on any critical looking log entries in the bug search toolkit to see if it matches with an existing bug/workaround.  

Consequences of what would happen are a bit tough without TAC review of logs to understand why its in that condition. I've seen the host OS be perfectly okay with the CIMC unresponsive but I've also seen host OS/virtual machines hang if they had a dependency on the CIMC like mapped virtual media or SD cards and couldn't access it anymore.

Thanks for the quick response Brian. Unfortunately, that CIMC is not responding to the connect command: 

GHC-1N-FI-A-A# connect cimc 2/6
Trying 127.5.2.6...
telnet: Unable to connect to remote host: No route to host
GHC-1N-FI-A-A#

P.S. I was able to connect to the non-error-condition blade 5 in the same chassis.

bflowers#
Cisco Employee
Cisco Employee

I have looked through a few cases for this issue and it appears it is resolved about half the time by a procedure including:

Disassociate the service profile

Decommission the server

Reset the server slot <- important

Reseat the blade

Reacknowledge the server

Associate the server profile

The other half of the time the issue is resolved by replacing the motherboard which includes the CIMC.

Given the CIMC appears unresponsive to CLI connection attempts, I think I would try resetting the CIMC during a maintenance window, the procedure for that is in the UCS Manager Server Management Guide for your UCSM release.

Because the CIMC is unresponsive to the CLI connection attempts, I don't think clicking ok to resolve the slot issue will help. The problem is with the CIMC to UCS Manager communication on the blade not with the slot.

I don't personally think there i high risk to leaving this issue unresolved for a short period of time until the workloads are migrated, but it's impossible to predict whether this is the beginning of a cascade failure on the motherboard. Try to prioritize the workload migration if possible.

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card