Re: UCS Manager firmware activate causes all blades reboot

yonyang · ‎01-04-2012

Environment:

One chassis, 6 blades, B200-M2

Two Fabric interconnects, 6120XP

Old firmware: 1.3.1n

New firmware: 1.4.3s

Problem:

When we activate UCS Manager from 1.3 to 1.4, we encountered all the 6 blades reboot suddenly and this does not match what the document said:

http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/upgrading/from1.4.2/to1.4.3/UpgradingCiscoUCSFrom1.4.2To1.4.3_chapter4.html

We did these steps:

1.) Update Adapter firmware

2.) Activate Adapter firmware one by one (downtime, but we can put esxi hosts in maintenance mode)

3.) Update CIMC firmware

4.) Activate CIMC firmware

5.) Update IO Module firmware

6.) Activate IO module firmware, but "Set Startup Version Only"

7.) Activate UCS Manager firmware <-- the problem occurs

We want to keep the VMs running during the firmware upgrade.

Downtime for one blade/ESXi host is acceptable, but downtime for all the blades are unacceptable for us.

Is there anyone know what causes the blade reboot when activating the UCS Manager firmware? Based on the release doc, only the session to GUI and CLI will be affected.

Thanks a lot and appreciate your help!

--Vincent

rewilkin · ‎01-04-2012

Unless I am missing something, you read the wrong doc. You should have read the one from 1.3 to 1.4 and your link says 1.4 to 1.4. Please clarify if what we are reading is correct.

Sent from Cisco Technical Support iPad App

yonyang · ‎01-04-2012

Thanks Reginald

Do we have to upgrade firmware 1.3.1n to 1.4.1 and then 1.4.3?

We now directly upgrade firmware from 1.3.1n to 1.4.3s. Sorry if we miss the information in any document.

So we will try 1.3.1n -> 1.4.1m -> 1.4.3s

Is this correct?

Thanks

Vincent

rewilkin · ‎01-04-2012

Vincent,

You can upgrade directly from one release to another. Each one has steps to follow. See the link below for each release. You will want to use 1.3 to 1.4 and follow it step by step.

http://http://www.cisco.com/en/US/products/ps10281/prod_installation_guides_list.html

Hope this helps.

yonyang · ‎01-04-2012

I now run into another issue, only one fabric interconnect get new version successfully and the other still run the old version and this results the cluster ip not pingable and the UCS manager not accessible.

Previous firmware: 1.4.3s

New firmware: 1.4.1m

After activate the UCS Manager, from the cli of fabric interconnect:

--------------------------------------

sdeucs-B# show cluster state

Cluster Id: 0xcfa2f2725b8811xxxxxxxxxx00059b790004

Incompatible versions:

local: 1.4(1m), peer: 1.4(3.0)

B: UP, ELECTION IN PROGRESS (Management services: UP)

A: UP, ELECTION IN PROGRESS (Management services: UNRESPONSIVE)

HA NOT READY

Management services are unresponsive on peer Fabric Interconnect

No device connected to this Fabric Interconnect

--------------------------------------

only one fabric interconnect downgrade the version successfully and the other not. This causes us lost the connectivitiy and also the management.

any hints here?

rewilkin · ‎01-05-2012

I would open a TAC case to get some visibility on this.

Sent from Cisco Technical Support iPad App

colin.lynch · ‎01-05-2012

There is a full video guide for 1.3x to 1.4x upgrade at the below site.

http://ucsguru.wordpress.com

Updating UCSM certainly should not cause any disruption other than having to restart your user session. Assuming your FI's were correctly clusted and HA was in an operational state prior to the UCSM upgrade.

Regards

Colin

yonyang · ‎01-05-2012

Hi Reginald

I have opened a TAC case: # 620261865

Can you help?

Thanks

rewilkin · ‎01-08-2012

Yong,

Was this resolved with TAC?

Sent from Cisco Technical Support iPhone App

yonyang · ‎01-10-2012

Almost Done.

The first reboot issue was caused by a bug:

http://cdetsweb-prd.cisco.com/apps/dumpcr?identifier=CSCtu17091&parentprogram=QDDTS

Brief info of this bug:

If we upgrade firmware 1.3.x to 1.4.3s directly, we will have blades unexpected reboot when we activate UCS manager.

The work around is to upgrade 1.3.x to 1.4.3r and then to 1.4.3s

The second wierd issue (only FI-B get activated and FI-A always stuck) was caused by corrupted mgmt db issue in FI-A (said by TAC engineer) and we have to rebuild FI-A and the cluter FI-B from scratch (erase all the configuration and init system) to fix the issue. Now it works fine.

By TAC engineer, they can't explain why the corruption happens and how we monitor it.