Solved: During disruptive upgrade

gnijs · ‎12-24-2015

Hi all,

Just wanted to let you know we experienced the following problems when upgrading a 5548UP to Version 7.2 (no issu, plain disruptive reboot)

From 5.2(1)N1(4) --> to 7.2(1)N1(1)

NOTE: power sequencer module 1 is upgraded from 2.0 -> 3.0 during this process

Images will be upgraded according to following table:
Module       Image         Running-Version             New-Version Upg-Required
------ ---------- ---------------------- ---------------------- ------------
     1      system             5.2(1)N1(4)             7.2(1)N1(1)           yes
     1   kickstart             5.2(1)N1(4)             7.2(1)N1(1)           yes
     1        bios      v3.6.0(05/09/2012)      v3.6.0(05/09/2012)            no
     1      SFP-uC                v1.1.0.0                v1.0.0.0            no
   110       fexth             5.2(1)N1(4)             7.2(1)N1(1)           yes
     1   power-seq                    v2.0                    v3.0           yes
     3   power-seq                    v2.0                    v2.0            no
     1          uC                v1.2.0.1                v1.2.0.1            no

Install is in progress, please wait.

Performing runtime checks.
SUCCESS

Setting boot variables.
SUCCESS

Performing configuration copy.
SUCCESS

Module 1: Refreshing compact flash and upgrading bios/loader/bootrom/power-seq.
Warning: please do not remove or power off the module at this time.
Note: Power-seq upgrade needs a power-cycle to take into effect.
Use command 'reload power-cycle'
Note: Micro-controller upgrade needs a power-cycle to take into effect.
On success of micro-controller upgrade, SWITCH OFF THE POWER to the system and then, power it up.
SUCCESS

Pre-loading modules.
SUCCESS

Finishing the upgrade, switch will reboot in 10 seconds.

After the upgrade, the switch was unreachable (we don't have console output of reboot)

We power-cycled the switch (for the power-sequencer upgrade)

After cycling power up, switch still unreachable.

Sent out local technician with console access:

- mgmt0 interface came up in "shutdown" mode (after upgrade) ?? That is why we couldn't connect to the switch anymore.

- all FEX config was gone

- mgmt0 port had the same description as first fex server port

Needless to say it took a lot of time to recover the switches. Config parsing bug ??

regards,

Geert

Rajeshkumar Gatti · ‎12-24-2015

You are most likely hitting CSCul22703 because you did a disruptive upgrade between incompatible images

Its is documented in the 7.x release notes about the possibility of hitting this issue.

The bug release notes mentions of workaround for future upgrades that you may have on the radar.

-Raj

P.S: Rate threads that helps as it saves time for your peers with similar issues.

View solution in original post

Rajeshkumar Gatti · ‎12-24-2015

You are most likely hitting CSCul22703 because you did a disruptive upgrade between incompatible images

Its is documented in the 7.x release notes about the possibility of hitting this issue.

The bug release notes mentions of workaround for future upgrades that you may have on the radar.

-Raj

P.S: Rate threads that helps as it saves time for your peers with similar issues.

gnijs · ‎01-08-2016

That might be right, but the documentation is very unclear on how to avoid this bug, if possible at all (the workaround is only a way to quickly restore config, it doesn't avoid the problem and multiple reboots are still required in order to restore fc config for example)

Isn't there anyway in avoiding it all together by upgrading first to a specifc 7.0 or 7.1 release and then up to 7.2 without loosing anything ? However, there still is no clear upgrade matrix from Cisco (and i am not even talking about ISSU upgrades, but plain "reload" upgrades)

This bug is a nasty one and has let to a lot of downtime. Especially the fact that after the upgrade, the interfaces come back online but without configuration (trunk config, vlans lost). Even shutting it down before the upgrade, doesn't help (comes up default unshut). Also, mgmt interface that gets shutdown, prevents remote upgrades and sometimes requires local engineer intervention (by the way: this can be avoided by configuring an in-line mgmt ip address, which i recommend highly before doing this upgrade remotely with no console access)

I hope to never encounter it again in future upgrades. Throwing away 50% of the config and shutting down the mgmt port at the same time, i have never experienced that in my +20 years in networking......Although, it can get even worse i guess, there is even a "bug" (?) where you simply brick your Nexus if you do a 'reload' upgrade on a specific model. oh horror

(although, when i think about it, bricking might actualy be better than coming online with "some" config)

Rajeshkumar Gatti · ‎01-08-2016

We do not recommend a plain reload upgrade using the traditional method for the Nexus5k/6k platform due to issues related to loss of config with Fex and other hardware programming issue. There was a warning added when customer attempted this change.

N5K-5672.76(config)# boot system bootflash:/n6000-uk9-kickstart.7.0.4.N1.1.bin
Warning: Changing bootvariables and reloading is not recommended on this platform. Use install all command for NX-OS upgrades/downgrades.

The warning was added under cscuo34379.

The usual method is to always upgrade using "install all" between compatible images which sometimes can be a two step process depending on how old the code is.

I am with you that this can be a painful exercise if you already end up being in a broken state.

My recommendation is always to review the release notes to understand any known caveat and if anything is not clear have a TAC case open to get it clarified.

-Raj

gnijs · ‎01-08-2016

First, It might have not been clear from previous posts, but we did use the "install all" method always. The only warning you get is "a reboot is required because of incompatible software". He always says that for major version upgrades, and a reboot, that was covered.

I can understand that for some ISSU upgrades some versions are incompatible, but now there seems to be even incompatible versions for discruptive upgrades.

Second,

I still believe there is no way to upgrade from 5.x to 7.2.x without hitting the bug. The release notes seem to indicate (but not clearly) that the bug happens somewhere from 7.0 -> 7.2. I read all the release notes (5.x 6.x 7.x 7.2.x).

The upgrade/downgrade paths tables in release notes suggest that there is compatibility from

5.2(1)N1(4) to 7.0(5)N1(1)

and from 7.0(6)N1(1) to 7.2(1)N1(1)

but from 7.0(5)N1(1) to 7.0(6)N1(1). No info ???

So i assume bug seems to be located between 7.0(5) and 7.0(6)

We did also upgrade two N5672 from 7.0(5)N1(1) -> 7.2(1)N1(1) (with install all) -> They were also affected, so this experience seems to support this.

Rajeshkumar Gatti · ‎01-08-2016

Let me review the release notes again specific to your query and revert back on this thread-

"I still believe there is no way to upgrade from 5.x to 7.2.x without hitting the bug. The release notes seem to indicate (but not clearly) that the bug happens somewhere from 7.0 -> 7.2. I read all the release notes (5.x 6.x 7.x 7.2.x)."

teracomcco · ‎05-27-2016

I hit the same problem, I used install all command and all pre checks.
I upgraded from n5000-uk9-kickstart.7.0.2.N1.1.bin to 7.3.0

mgmt interface had a config from other fex interface with full duplex and speed, so I lost the access to switch.

Config was totally wrong, missing vlan on interfaces.

Many vlans was missed, and new vlans with we newer had before in our switches appeared instead.

Because of that, the secondary switch which suppose to be redundancy but was in a same VDC domain was gone too and all network went down, I suppose because of a lot of suspended VLAN:s, and spanning-tree recalculations.

Took me almost 3 ours to find all errors, then involve a lot of on call people because they needed to restart some database servers and a lot of incident reports to write.

Completely unacceptable issu/bug. Newer seen this kind of errors either in 16 years of my experience in my works with Cisco and other products. This issue should be escalated and fixed for all time. All NX-OS images with are affected by this issue should be removed from Cisco website.

redouan.boussebaa · ‎11-29-2016

During disruptive upgrade from 5.2.1 to 7.0.5 hit the same bug, lost connectivity to the fex, and saw mgmt interface in shutdown force.

The upgrade/downgrade paths tables in release notes suggest that there is compatibility from

5.2(1)N1(3) to 7.0(5)N1(1)

Upgrade to 6.x instead to 7 untill this bug is fixed.

teracomcco · ‎05-27-2016

I hit the same problem, I used install all command and all pre checks.
I upgraded from n5000-uk9-kickstart.7.0.2.N1.1.bin to 7.3.0

mgmt interface had a config from other fex interface with full duplex and speed, so I lost the access to switch.

Config was totally wrong, missing vlan on interfaces.

Many vlans was missed, and new vlans with we newer had before in our switches appeared instead.

Because of that, the secondary switch which suppose to be redundancy but was in a same VDC domain was gone too and all network went down, I suppose because of a lot of suspended VLAN:s, and spanning-tree recalculations.

Took me almost 3 ours to find all errors, then involve a lot of on call people because they needed to restart some database servers and a lot of incident reports to write.

Completely unacceptable issu/bug. Newer seen this kind of errors either in 16 years of my experience in my works with Cisco and other products. This issue should be escalated and fixed for all time. All NX-OS images with are affected by this issue should be removed from Cisco website.

Cisco Nexus 5548UP upgrade fail - version 7.2