cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2164
Views
0
Helpful
3
Replies

Both Fabric Interconnect rebooted during firmware upgrade

gtlalpac
Level 1
Level 1

Hello,

A colleague of mine performed a UCS firmware some days ago and found an odd situation.

He was upgrading from UCS 2.0(1w) to 2.1(1f).

He followed the guide posted here, manual option.

He activated UCS Manager.

Then He updated the IOM firmware with the "Set Startup Version Only" option enabled as indicated in the guide. After this he lost access to the UCSM GUI and found the following errors in the logs:

2013 Oct  2 22:40:54 MCM-ACS-UCS01-A %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "Ldap Daemon" (PID 4825) hasn't caught signal 9 (no core).

2013 Oct  2 22:41:19 MCM-ACS-UCS01-A %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "PMon" (PID 4842) hasn't caught signal 9 (no core).

2013 Oct  2 22:41:33 MCM-ACS-UCS01-A %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "snmpd" (PID 4829) hasn't caught signal 11 (core will be saved).

2013 Oct  2 22:41:50 MCM-ACS-UCS01-A %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "snmpd" (PID 27869) hasn't caught signal 9 (no core).

2013 Oct  2 22:42:29 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-VERSION_INCOMPATIBLE: [F0430][critical][version-incompatible][sys/mgmt-entity-A] Fabric Interconnect A, management services, incompatible versions

2013 Oct  2 22:42:29 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-VERSION_INCOMPATIBLE: [F0430][critical][version-incompatible][sys/mgmt-entity-B] Fabric Interconnect B, management services, incompatible versions

2013 Oct  2 22:42:29 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-MANAGEMENT_SERVICES_UNRESPONSIVE: [F0452][critical][management-services-unresponsive][sys/mgmt-entity-B] Fabric Interconnect B, management services are unresponsive

2013 Oct  2 22:43:04 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-MANAGEMENT_SERVICES_FAILURE: [F0451][critical][management-services-failure][sys/mgmt-entity-B] Fabric Interconnect B, management services have failed

2013 Oct  2 22:43:04 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-VERSION_INCOMPATIBLE: [F0430][cleared][version-incompatible][sys/mgmt-entity-A] Fabric Interconnect A, management services, incompatible versions

2013 Oct  2 22:43:04 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-VERSION_INCOMPATIBLE: [F0430][cleared][version-incompatible][sys/mgmt-entity-B] Fabric Interconnect B, management services, incompatible versions

2013 Oct  2 22:43:04 MCM-ACS-UCS01-A %$ VDC-1 %$ %UCSM-2-MANAGEMENT_SERVICES_UNRESPONSIVE: [F0452][cleared][management-services-unresponsive][sys/mgmt-entity-B] Fabric Interconnect B, management services are unresponsive

2013 Oct  2 23:04:09 MCM-ACS-UCS01-A %$ VDC-1 %$ %PFMA-2-PFM_SYSTEM_RESET: Manual system restart from Command Line Interface

2013 Oct  2 23:04:11 MCM-ACS-UCS01-A %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 2 is offline

2013 Oct  2 23:04:11 MCM-ACS-UCS01-A %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-2 Off-line (Serial Number QCI1548A020)

2013 Oct  2 23:04:11 MCM-ACS-UCS01-A %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-1 Off-line (Serial Number QCI1547A0WQ)

2013 Oct  2 23:04:11 MCM-ACS-UCS01-A %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 1 is offline

2013 Oct  2 23:04:13 MCM-ACS-UCS01-A %$ VDC-1 %$ Oct  2 23:04:13 %KERN-0-SYSTEM_MSG: Shutdown Ports.. - kernel

2013 Oct  2 23:04:13 MCM-ACS-UCS01-A %$ VDC-1 %$ Oct  2 23:04:13 %KERN-0-SYSTEM_MSG:  writing reset reason 9,  - kernel

It seems that the FI were rebooted after hitting some condition. It seems that he only indicator we have is the "SERVICE_CRASHED" messages that occurred some minutes before the FI reset. I found in the 2.1 release notes that a similar condition causes a similar behavior (both FI reset) and is fixed in 2.1(2a)A.

CSCug20103

The FIs will no longer reset with the following error message:

%SYSMGR-2-SERVICE_CRASHED: Service "monitor" (PID XXXX) hasn't 
caught signal 6 (core will be saved).
%KERN-0-SYSTEM_MSG: writing reset reason 16, monitor hap reset - 
kernel 

1.4(1j)A

2.1(2a)A

Could the situation we faced be a variant of the bug?

Unfortunately, we don´t have any dump.

We have reviewed the logs several times and We couldn´t find any additional information of what happened to the FI, is there any specific file in the tech support bundle to look for additional information?

Any advice will be apprecaited.

Thanks.

3 Replies 3

Keny Perez
Level 8
Level 8

Hello Gabriel,

Any chance you can connect to nxos a/b and run a "show system reset-reason" ? (that is the command from the top of my head and at this time of the day)

-Kenny

Thank you Kenny,

From the tech support file:

`show system reset-reason`

----- reset reason for Supervisor-module 1 (from Supervisor in slot 1) ---

1) At 59578 usecs after Wed Oct  2 23:04:19 2013

    Reason: Reset Requested by CLI command reload

    Service:

    Version: 5.0(3)N2(2.1w)

2) No time

    Reason: Unknown

    Service:

    Version: 5.0(3)N2(2.1w)

3) No time

    Reason: Unknown

    Service:

    Version: 5.0(3)N2(2.1w)

4) At 215436 usecs after Thu Apr 12 09:38:19 2012

    Reason: Reset Requested by CLI command reload

    Service:

    Version: 5.0(3)N2(2.1q)

The person performing the upgrade told me that he lost access to both FI, so there was no way someone could reset them from the CLI.

Gabriel,

Does Wed Oct  2 23:04:19 2013 match with the time the reboot took place?

Also, that is from one of the FI perspective, do you see the same from the other FI?  You may specify it if you do type the FI you want to connect to either connect nxos a  OR connect nxos b.

By checking the logs we can determine if the behavior was expected from the update, however looking at the message below, we might need to check the process in depth:

UCSM-2-MANAGEMENT_SERVICES_FAILURE:  [F0451][critical][management-services-failure][sys/mgmt-entity-B] Fabric  Interconnect B, management services have failed

I recommend you to open a TAC case so we can gather a show tech and analyze this further, otherwise I might end up asking for so many commands here in the community.

Now you are supposed to be able to open cases from this threads, try it with this one.

-Kenny

Cisco Support Community is also present in Spanish:

https://supportforums.cisco.com/community/spanish/data_center

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Review Cisco Networking products for a $25 gift card