UCS Firmware-2.#(##)-should I or wait ... or ...

K P SIM · ‎10-09-2011

Hi, Cisco:

I have heard a lot of horror stories after UPGRADING to Cisco UCS 2.#(##).

My customer is running 1.4(2B) and I have to perform an upgrade, so I am pondering SHOULD I upgrade to 1.4(3#) or 2.#(##). It is a production environment. SAN Booting, UCS M81KR, ESX 4.1i, etc.

i want to hear some candid feedbacks from Cisco or others.

Thanks.

SiM

Jeremy Waldrop · ‎10-10-2011

SiM, the only big issue I know about is the PCI re-ording issue where the PCI index of the vNICs/vHBAs changes when you update the BIOS of a blade to 2.0. The workaroud is to manally modify the esx.conf file so that the vmnic number matches up to what you have configured in VMware.

I would update everything to 2.01m except for the BIOS so that you don't have to modify the esx.conf file on all of your hosts.

Take a look at this advisory for more info -

http://www.cisco.com/web/software/datacenter/ucs/UCS_2_0_Software_Advisory.htm

Robert Burns · ‎10-10-2011

Agreed with Jeremy.

The 2.0(1o) - [tentative patch version] is slated for end of the month. I've tested it in my lab and it avoids the PCI renumeration bug in 2.0(1m). Sit tight until it finishes passing our QA process and then upgrade from 1.4.x onward.

Regards,

Robert

K P SIM · ‎10-10-2011

Hi, Rob & Jeremy:

Thanks.

Now customer EXPECTS our firmware upgrade to be ZERO DOWN TIME, is it possible or is there any proven steps which are working.

Assuming ESXi 4.1 MPIO is working on FC Fabric

Assuming SAN Booting

Assuming all Service Profiles with Enable Failover feature checked

Assuming 2 Fabric Interconnect, 2 IOM, M81KR, firmware is 1.4(##), forget the exact version, 6 vmnic in each VM

Customer plans to vMotion VMs from 1 Chassis to the other (say chassis 1 to chassis 2), so they can upgrade the firmware on all the blades-Adapter, CIMC, IOM in chassis 1 first. After this chassis is done with firmware upgrade, rebooted all components, they WILL vMotion back the VMs to chassis 1, perform the upgrade on chassis 2, etc.

How much is the actual downtime at all with the above scenario? I understand UCS is not a fault redundant system.

Appreciate it again!

SiM,

Robert Burns · ‎10-10-2011

UCS is a fully redundant system. You can certainly do an entire system upgrade without downtime, assuming you'r eusing VMotion to move VMs around from blade to blade.

Y"ou can follow the documented upgrade guides on CCO which will ensure you have no downtime assuming you have host cluster (such as VMware & VMotion).

Best way to do it in a nutshell with minimal impact is

Download latest FW packages

Take a Full & All Config backup

Disable Smart Call home if enabled

Update Adapter, CIMC, IOM

Activate all adatpers (set startup only)

Full activate all CIMC (zero impact to host)

Activate IOM (set startup version only)

Activate UCSM

Activate Secondary FI (allow too 100% come back in sync with HA status)

Activate Primary FI " " " "

Create FW Package with new blade BIOS

Evacuate first Host (Vmotion guest VMs)

Shutdown first host, & associate update FW Policy with this host

[this will also reboot the adapter at the same time, activating the adatper FW]

Bring back online and repeat for each subsequent host.

Total time from start to finish depends on host many VMs, and hosts you're updating (BIOS). You should be able to easily do two full chassis' within 2 hours - all without downtime assuming all HA is functional.

Regards,

Robert

K P SIM · ‎10-12-2011

Rob:

I want to know the honest answer Is there any downtime at all as i am answerable to customer?

Example:

1.

Chassis 1 - upgrade first - PALO, CIMC of all blades in this chassis, IOM1 follows by IOM2. No BIOS upgrade at all. vMotion all VMs from Chassis 1 to Chassis 2 BEFORE Activate all the relevant firmware on THESE END POINTS.

Make sure Chassis 1 is UP. vMotion all VMs from Chassis 2 to Chassis 1 now.

2.

Chassis 2 - upgrade next. Activate all the relevant firmwares, rebooting, etc. vMotion all VMs back from Chassis 1 now.

3.

UCSM Firmware Upgrade. Re-Login as no downtime here.

4.

Subordinate FI-Upgrade Firmware, Activate Firmware, Rebooting this FI. Should NOT have downtime to those VMs , I presume. MAKE THIS THE PRIMARY FI.

Primary Subordinate FI-Upgrade firmware, Activate Firmware. This will become the SUBORDINATE FI.

I have read all publicly available Cisco docs. and there is NO MENTION OF ZERO DOWNTIME for UCS firmware upgrade.

Thanks.

SiM

Jeremy Waldrop · ‎10-12-2011

SiM, as with most things in IT there aren't any 100% guarantees. I usually have the customer do the upgrade during a maintenance window or at least send out an email to the business owners that there shouldn't be an outage but there may be a small blip.

As for your upgrade procedure:

You can't really go chassis by chassis because the IOM activation for all chassis doesn't happen until the connected Fabric Interconnect is activated and rebooted.

The CIMC and UCSM are hitless activations.
When activating the adapters make sure you leave the "set startup version only" checked.
Make sure you have a user-acknowledgement maintenance policy associated to your service proflies. This will save you but if you were to update the BIOS using a host firmware policy, or if you were to use a host firmware policy to update the adapters.
When activating the IOM make sure you leave the "set startup version only" checked

As long a everything is dual-homed to both Fabrics you shouldn't have any downtime. You may notice a small hit with the FIs get rebooted as ESX will have to failover vNICs and SAN paths but in my experience it hasn't been a noticeable hit.

and most of all BE PATIENT!! The FI activation can take 15 minutes so don't freak out when it seems like it isn't working.

When we do UCS installs one of the last things we do befor the system goes into production is to test LAN/SAN resilientcy by shutting ports.

K P SIM · ‎10-12-2011

Jeremy:

I need some clarifications here. You mentioned that I CAN'T do an upgrade chassis by chassis. To the best of my knowledge, I CAN do a IOM Reset RATHER THAN waiting for the big bang reboot of FI, hence the IOM.

Even a chassis CAN BE reset, am I right, Jeremy/Rob? i know yo guys are techies, we have to be absolutely sure, here.

Please clarify while I am working out the detail steps, even the L3 switch to which the 2 FI connect needs to be taken care of, I believe with port fail fast settings, etc. i believe. Isn't it?

I am at the ground facing customer directly, guys, so let be 100% sure.

thanks.

SiM

Jeremy Waldrop · ‎10-13-2011

SiM, the IO module firmware version must always match the connected Fabric Interconnect version. The software forces this version consistency. It works just like the Nexus 5000/7000 and Nexus 2000 (FEX) relationship, a FEX will always match the version of firmware that is on the parent switch.

If you were to not leave the "set startup version" box checked on the IOM activation and reset it the IOM would reboot and then its firmware would get reset back to the version running on the FI. Then when the FI activation is performed it would also update all the conected IOM to the same version and reboot them. So there is no way around having to do it the documented way.

The way that you are propossing would actually cause more downtime because the IOM would reboot twice.

Have you read through this guide? -

http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/upgrading/from1.4/to2.0/b_UpgradingCiscoUCSFrom1.4To2.0.html

In this section it talks about the IOM activation and how it must match the FI version -

http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/upgrading/from1.4/to2.0/UpgradingCiscoUCSFrom1.4To2.0_chapter4.html#task_F0C09BC50D2048A1B0C495F7F6E6093A

It states:

"Important:

When you configure Set Startup Version Only for an I/O module, the I/O module is rebooted when the fabric interconnect in its data path is rebooted. If you do not configure Set Startup Version Only for an I/O module, the I/O module reboots and disrupts traffic. In addition, if Cisco UCS Manager detects a protocol and firmware version mismatch between the fabric interconnect and the I/O module, Cisco UCS Manager automatically updates the I/O module with the firmware version that matches the firmware in the fabric interconnect and then activates the firmware and reboots the I/O module again."

On your northbound L2/L3 switch "spanning-tree portfast trunk" for Catalyst or "spanning-tree port type edge trunk" for Nexus should be configured on the interfaces and port-channel interfaces.

As long as you folllow the upgrade guide and have all Service Profiles configured with vNICs/vHBAs in both Fabrics and have a user-acknowledged maintenance policy configured you shouln't have any noticable downtime.

K P SIM · ‎11-20-2011

Hi, Jeremy and Robert:

Thank for the very detail hence confident boosting steps wrt the firmware upgrade. Client decides to stay at 1.4(3S) and the firmware upgrade went WO a glitch, though the ESXi Management Agents for some blades had to be restarted.

So it is truly 0 downtime.

Great Cisco gurus' inputs.

Thank a trillion!

SiM

gballard · ‎11-22-2011

We just went to 2.01s which was released literally 3 or four days before we upgraded. So far, so good. We needed the vsphere 5 support.

I think a good thing to do before any upgrade is to get a full backup of the UCS config.

Jason Benedicic · ‎11-23-2011

One thing I would like to point out regarding upgrades (minor or major releases) as I don't see it mentioned too often in people's upgrade plans.

When using Palo adapters always remember to upgrade the enic & fnic drivers of VMware to the correctly supported versions from the compatibility matrix.

Those people still running 4.1/4.1u1 of vSphere will find they are a number of revisions behind the currently supported version even from a clean install.

These drivers can be found on the Cisco site on the B-series Driver CD image and these need to be installed using the appropriate "no signature" checking method depending on how you update (VUM, vSphere CLI etc), sometimes you'll find the latest version on the VMware site under the Driver CD's section, however this doesn't appear to be updated as quickly as Cisco's releases are.

Sent from Cisco Technical Support iPhone App

kg6itcraig · ‎10-11-2011

Wait until the next release comes out.

https://supportforums.cisco.com/message/3454321#3454321

Also on another UCS system (lab) we tried rolling back from a complete 2.0. It bricked the UCS completely. Had to restore a config backup.

Stay away from 2.0. Wait until it is fixed. I don't believe all the issue are even found yet.

Craig

My UCS Blog http://realworlducs.com