04-03-2018 04:43 PM - edited 03-01-2019 05:30 AM
Hi,
I have a requirement to establish the best path to upgrading a VSS pair of Catalyst 6500s in such a way that:
Downtime of a single chassis isn't the primary concern, instead we are trying to manage risk associated with this field notice (https://www.cisco.com/c/en/us/support/docs/field-notices/637/fn63743.html) which has historically caused major problems in the past (total primary DC loss).
Understanding that the fault is a "fix-on-fail" we need to plan for it "not working" during the upgrade and leaving the site running on a single chassis while any failed units are replaced/etc.
Specifically I've been looking at the FSU and eFSU upgrade processes. These are described here: https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SX/configuration/guide/book/vss.html#wp1169328
eFSU appears to be great for a situation whereby I could trust hardware operation (i.e. it does one then the other). Given the field notice though this isn't the case. FSU seems to be the way to go for a more manual upgrade/failover method though and is where I am looking at currently. Does anyone have specific guidance here?
Regarding FSU:
Performing a Fast Software Upgrade of a VSS
The FSU of a VSS is similar to the RPR-based standalone chassis FSU described in the "Performing a Fast Software Upgrade" section. While the standalone chassis upgrade is initiated by reloading the VSS standby supervisor engine, the VSS upgrade is initiated by reloading the VSS standby chassis. During the FSU procedure, a software version mismatch between the VSS active and the VSS standby chassis causes the system to boot in RPR redundancy mode, which is stateless and causes a hard reset of the all modules. As a result, the FSU procedure requires system downtime corresponding to the RPR switchover time.
I am trying to get clarification on a few things metnioned in the above:
Solved! Go to Solution.
04-02-2019 01:02 AM
We did an FSU upgrade. 50% success rate. The supervisors can get stuck in a boot cycle which requires a full power down and on reload to recover from.
Basically the process is:
You validate whether line cards have failed by looking at show module outputs.
The FSU process is only going to make a minute or two of difference honestly. You still have 10mins of downtime for the line cards.
04-04-2018 12:50 AM
@jbekk wrote:
Downtime of a single chassis isn't the primary concern, instead we are trying to manage risk associated with this field notice (https://www.cisco.com/c/en/us/support/docs/field-notices/637/fn63743.html) which has historically caused major problems in the past (total primary DC loss).
I am very familiar with this FN and whatever method you're going to be using it is of no help. If the line card fails, it will fail.
The only thing I can think of is prepare for the worst case scenario.
1. Spare line card handy;
2. Config and IOS exported to a CF (which can be used by the spare line card);
3. 4x8 SNT maintenance contract.
04-04-2018 05:42 PM
We are working under the assumption "it won't work". Support contracts/etc are already in place/etc. We'll be lodging pre-emptive TAC cases/etc (got started on that process yesterday). So yes agree with you totally...
Regarding the questions I floated around the upgrade process... I've not done a 6500 VSS upgrade before. I know I am up for fun regardless of my best efforts due to the FN. I've been told by others that the upgrade path generally follows this process when using the FSU method:
Any guidance welcome. Thanks.
04-05-2018 12:05 AM
@jbekk wrote:
- Upgrade second chassis and reload the second chassis and its modules. Ports on second chassis stay offline.
- Second chassis/supervisor syncs with primary.
Won't work. the minute the second chassis boots up it will go into ROMmon because the VSS pair are running very different IOS versions.
@jbekk wrote:
- Force failover of "master" supervisor from primary chassis to second chassis. Modules on primary chassis reload at this point. Traffic is disrupted on both chassis... ports on second chassis become active.
- Upgrade primary chassis (reloading the supervisor there and its modules again). Ports on second chassis remains active.
Won't work because once the secondary goes into ROMmon, the only way to recover is by human intervention. The primary will boot up in a semi-vss state.
Look, if all the prep work (in anticipation that the supervisor card will fail) then why not just bite-the-bullet and reboot the entire lot?
04-09-2018 05:35 PM
The bit that concerns me (mainly from a political/bureaucratic perspective) is that there doesn't seem to be a Cisco published guide that describes or recommends this "both at the same time" VSS upgrade process.
Every guide says to use FSU or eFSU for VSS upgrades. eFSU (i.e. ISSU) is easy enough to negate based on the compatibility matrix that Cisco publish (basically unless you are doing an upgrade within the same family it is not worth looking at). But FSU compatibility or rationale isn't really discussed anywhere.
Beyond your own personal experiences, how have you made the assessment to do the VSS upgrade by doing "both at once" instead of following the documented FSU path? (I've asked TAC the same question BTW). It just irks me that everyone recommends a path that isn't documented... and makes my justifying the outage to stakeholders an even harder task...
04-09-2018 06:39 PM - edited 04-09-2018 06:43 PM
From my own experience, I have a spate of bad luck when dealing with FSU/eFSU and ISSU. This is why I'm using my own method of VSS upgrade.
The fact that TAC agrees that using FSU/eFSU and ISSU is the "lesser of two evils" is surprising.
NOTE: One thing I forgot to ask: What is the config-registry value currently set? Is it 0x2101 or 0x2102?
04-09-2018 08:45 PM
I'm aware of that little gotcha as well. 0x2102 is used everywhere.
Will respond once I get a finalized word from Cisco on the upgrade method.
04-02-2019 01:02 AM
We did an FSU upgrade. 50% success rate. The supervisors can get stuck in a boot cycle which requires a full power down and on reload to recover from.
Basically the process is:
You validate whether line cards have failed by looking at show module outputs.
The FSU process is only going to make a minute or two of difference honestly. You still have 10mins of downtime for the line cards.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide