Solved: Re: Catalyst 6500 VSS - FSU Upgrade Procedure Query

jbekk · ‎04-03-2018

Hi,

I have a requirement to establish the best path to upgrading a VSS pair of Catalyst 6500s in such a way that:

We will reboot only one chassis and/or supervisor at a time
We will maintain control over when and if the upgrade proceeds on the second chassis

Downtime of a single chassis isn't the primary concern, instead we are trying to manage risk associated with this field notice (https://www.cisco.com/c/en/us/support/docs/field-notices/637/fn63743.html) which has historically caused major problems in the past (total primary DC loss).

Understanding that the fault is a "fix-on-fail" we need to plan for it "not working" during the upgrade and leaving the site running on a single chassis while any failed units are replaced/etc.

Specifically I've been looking at the FSU and eFSU upgrade processes. These are described here: https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SX/configuration/guide/book/vss.html#wp1169328

eFSU appears to be great for a situation whereby I could trust hardware operation (i.e. it does one then the other). Given the field notice though this isn't the case. FSU seems to be the way to go for a more manual upgrade/failover method though and is where I am looking at currently. Does anyone have specific guidance here?

Regarding FSU:

Performing a Fast Software Upgrade of a VSS
The FSU of a VSS is similar to the RPR-based standalone chassis FSU described in the "Performing a Fast Software Upgrade" section. While the standalone chassis upgrade is initiated by reloading the VSS standby supervisor engine, the VSS upgrade is initiated by reloading the VSS standby chassis. During the FSU procedure, a software version mismatch between the VSS active and the VSS standby chassis causes the system to boot in RPR redundancy mode, which is stateless and causes a hard reset of the all modules. As a result, the FSU procedure requires system downtime corresponding to the RPR switchover time.

I am trying to get clarification on a few things metnioned in the above:

What is meant by "all modules". Is that all modules in the sedondary chassis or is that all modules in all chassis?
How can we estimate/gauge the RPR switchover time?

jbekk · ‎04-02-2019

We did an FSU upgrade. 50% success rate. The supervisors can get stuck in a boot cycle which requires a full power down and on reload to recover from.

Basically the process is:

12.2SY and above you reload the standby supervisor. 12.2SX and below you just force a failover to the standby chassis per last dot point below.
It reboots and recognizes different IOS are running then goes into something called RPR mode. It's basically a "I've booted but I can't sync with the active supervisor" mode. The line cards in the standby chassis won't start to boot when the chassis supervisor is in RPR mode. In some cases the supervisor gets stuck in a boot loop (monitor the console logging). If it does... do a physical reload on the standby chassis.
You then force a failover to the standby supervisor. Active chassis and supervisor reload. Standby chassis line cards start booting. You get about 10mins of downtime waiting for line cards to come good.

You validate whether line cards have failed by looking at show module outputs.

The FSU process is only going to make a minute or two of difference honestly. You still have 10mins of downtime for the line cards.

View solution in original post

Leo Laohoo · ‎04-04-2018

@jbekk wrote:

Downtime of a single chassis isn't the primary concern, instead we are trying to manage risk associated with this field notice (https://www.cisco.com/c/en/us/support/docs/field-notices/637/fn63743.html) which has historically caused major problems in the past (total primary DC loss).

I am very familiar with this FN and whatever method you're going to be using it is of no help. If the line card fails, it will fail.

The only thing I can think of is prepare for the worst case scenario.

1. Spare line card handy;

2. Config and IOS exported to a CF (which can be used by the spare line card);

3. 4x8 SNT maintenance contract.

jbekk · ‎04-04-2018

We are working under the assumption "it won't work". Support contracts/etc are already in place/etc. We'll be lodging pre-emptive TAC cases/etc (got started on that process yesterday). So yes agree with you totally...

Regarding the questions I floated around the upgrade process... I've not done a 6500 VSS upgrade before. I know I am up for fun regardless of my best efforts due to the FN. I've been told by others that the upgrade path generally follows this process when using the FSU method:

Upgrade second chassis and reload the second chassis and its modules. Ports on second chassis stay offline.
Second chassis/supervisor syncs with primary.
Force failover of "master" supervisor from primary chassis to second chassis. Modules on primary chassis reload at this point. Traffic is disrupted on both chassis... ports on second chassis become active.
Upgrade primary chassis (reloading the supervisor there and its modules again). Ports on second chassis remains active.

Any guidance welcome. Thanks.

Leo Laohoo · ‎04-05-2018

@jbekk wrote:

Upgrade second chassis and reload the second chassis and its modules. Ports on second chassis stay offline.

Second chassis/supervisor syncs with primary.

Won't work. the minute the second chassis boots up it will go into ROMmon because the VSS pair are running very different IOS versions.

@jbekk wrote:

Force failover of "master" supervisor from primary chassis to second chassis. Modules on primary chassis reload at this point. Traffic is disrupted on both chassis... ports on second chassis become active.

Upgrade primary chassis (reloading the supervisor there and its modules again). Ports on second chassis remains active.

Won't work because once the secondary goes into ROMmon, the only way to recover is by human intervention. The primary will boot up in a semi-vss state.

Look, if all the prep work (in anticipation that the supervisor card will fail) then why not just bite-the-bullet and reboot the entire lot?

jbekk · ‎04-09-2018

The bit that concerns me (mainly from a political/bureaucratic perspective) is that there doesn't seem to be a Cisco published guide that describes or recommends this "both at the same time" VSS upgrade process.

Every guide says to use FSU or eFSU for VSS upgrades. eFSU (i.e. ISSU) is easy enough to negate based on the compatibility matrix that Cisco publish (basically unless you are doing an upgrade within the same family it is not worth looking at). But FSU compatibility or rationale isn't really discussed anywhere.

Beyond your own personal experiences, how have you made the assessment to do the VSS upgrade by doing "both at once" instead of following the documented FSU path? (I've asked TAC the same question BTW). It just irks me that everyone recommends a path that isn't documented... and makes my justifying the outage to stakeholders an even harder task...

Leo Laohoo · ‎04-09-2018

From my own experience, I have a spate of bad luck when dealing with FSU/eFSU and ISSU. This is why I'm using my own method of VSS upgrade.

The fact that TAC agrees that using FSU/eFSU and ISSU is the "lesser of two evils" is surprising.

NOTE: One thing I forgot to ask: What is the config-registry value currently set? Is it 0x2101 or 0x2102?

jbekk · ‎04-09-2018

I'm aware of that little gotcha as well. 0x2102 is used everywhere.

Will respond once I get a finalized word from Cisco on the upgrade method.

jbekk · ‎04-02-2019

We did an FSU upgrade. 50% success rate. The supervisors can get stuck in a boot cycle which requires a full power down and on reload to recover from.

Basically the process is:

12.2SY and above you reload the standby supervisor. 12.2SX and below you just force a failover to the standby chassis per last dot point below.
It reboots and recognizes different IOS are running then goes into something called RPR mode. It's basically a "I've booted but I can't sync with the active supervisor" mode. The line cards in the standby chassis won't start to boot when the chassis supervisor is in RPR mode. In some cases the supervisor gets stuck in a boot loop (monitor the console logging). If it does... do a physical reload on the standby chassis.
You then force a failover to the standby supervisor. Active chassis and supervisor reload. Standby chassis line cards start booting. You get about 10mins of downtime waiting for line cards to come good.

You validate whether line cards have failed by looking at show module outputs.

The FSU process is only going to make a minute or two of difference honestly. You still have 10mins of downtime for the line cards.