cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1053
Views
0
Helpful
7
Replies

Dual-Sup 9410 issu upgrade gone bad. "critical boot tasks failed"

RVTim
Level 1
Level 1

I had one of three 9410's not complete a full ISSU upgrade.   The code was copied to the device, verified the hash, and I used this command to kick off the upgrade:

install add file flash:cat9k_iosxe.17.09.04a.SPA.bin activate issu commit

When the standby sup reloaded, it went into a boot loop where this all repeats:

=================================================================================

Initializing Hardware......

System Bootstrap, Version 17.8.1r[FC1], RELEASE SOFTWARE (P)
Compiled Tue 02/01/2022 13:16:47.55 by rel

Current ROMMON image : Primary
Last reset cause : SoftwareResetTrig
C9400-SUP-1 platform with 16777216 Kbytes of main memory

Preparing to autoboot. [Press Ctrl-C to interrupt] 0
boot: attempting to boot from [bootflash:packages.conf]
boot: reading file packages.conf
#


Oct 26 00:11:38.624: %BOOT-3-SYSD_STARTFAIL: R1/0: Failed to launch boot task mount_packages.service ( exit-code )
Oct 26 00:11:39.289: %BOOT-0-BOOT_COMPLETE_FAIL: R1/0: Critical boot tasks failed: * *

=================================================================================

So understanding that the standby wouldn't boot, I dropped to rommon and tried manually booting via packages.conf, but that had the same result. Luckily I had a usb drive in with the software on it, so I booted the .bin file using   "boot usbflash0: <filename>" and that brought the switch up.  Once the standby (slot 6) was booted, the ISSU process kept going, and it upgraded slot 5's SUP, and that one successfully auto-booted.   At that point I had slot6 as ACTIVE and slot5 as Standby Hot.   With them both booted, I copied the packages.conf file from the good sup to the other, just in case it was that file being bad.   Then did a force-switchover and it still failed to boot. 

At this point, the switch works fine, but, I know I'm not all in a good place yet.    The ISSU process hasn't fully completed, with "show install summary showing slot 6 as "Activated & Committed" but slot 5 (the good one) as "Activated & Uncommitted".   My guess is that I could run the commit command on that one and it would show as finished.  

Here's some info:

==================================================================================================

MY-9410#sh redund
Redundant System Information :
------------------------------
Available system uptime = 1 year, 35 weeks, 1 hour, 58 minutes
Switchovers system experienced = 10
Standby failures = 1
Last switchover reason = active unit removed

Hardware Mode = Duplex
Configured Redundancy Mode = sso
Operating Redundancy Mode = sso
Maintenance Mode = Disabled
Communications = Up

Current Processor Information :
-------------------------------
Active Location = slot 5
Current Software state = ACTIVE
Uptime in current state = 18 hours, 41 minutes
Image Version = Cisco IOS Software [Cupertino], Catalyst L3 Switch Software (CAT9K_IOSXE), Version 17.9.4a, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2023 by Cisco Systems, Inc.
Compiled Fri 20-Oct-23 10:44 by mcpre
BOOT = bootflash:packages.conf;
CONFIG_FILE =
Fast Switchover = Disabled
Initial Garp = Disabled

Peer Processor Information :
----------------------------
Standby Location = slot 6
Current Software state = STANDBY HOT
Uptime in current state = 18 hours, 8 minutes
Image Version = Cisco IOS Software [Cupertino], Catalyst L3 Switch Software (CAT9K_IOSXE), Version 17.9.4a, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2023 by Cisco Systems, Inc.
Compiled Fri 20-Oct-23 10:44 by mcpre
BOOT = bootflash:packages.conf;
CONFIG_FILE =

 

MY-9410#sh issu state det
Current ISSU Status: In Progress
Previous ISSU Operation: Successful
=======================================================
System Check Status
-------------------------------------------------------
Platform ISSU Support Yes
Standby Online Yes
Autoboot Enabled Yes
SSO Mode Yes
Install Boot Yes
Valid Boot Media Yes
Operational Mode HA-STANDALONE
=======================================================
Added Image:
Name Compatible
-------------------------------------------------------
17.09.04a.0.6 Yes

Operation type: One-shot ISSU
Install type : Image installation using ISSU
Current state : Activated state
Last operation: Switchover

Completed operations:

Operation Start time
-------------------------------------------------------
Activate location standby R1 2023-10-25:19:04:16
Activate location active R0 2023-10-25:19:39:31
Switchover 2023-10-25:19:40:43

State transition: Added -> Standby activated -> Active switched-over

Auto abort timer: inactive
Abort Reason: N/A
Running image: bootflash:packages.conf
Operating mode: sso, terminal state reached

 


MY-9410#show install summ
[ R0 ] Installed Package(s) Information:
State (St): I - Inactive, U - Activated & Uncommitted,
C - Activated & Committed, D - Deactivated & Uncommitted
--------------------------------------------------------------------------------
Type St Filename/Version
--------------------------------------------------------------------------------
IMG U 17.09.04a.0.6


[ R1 ] Installed Package(s) Information:
State (St): I - Inactive, U - Activated & Uncommitted,
C - Activated & Committed, D - Deactivated & Uncommitted
--------------------------------------------------------------------------------
Type St Filename/Version
--------------------------------------------------------------------------------
IMG C 17.09.04a.0.6

--------------------------------------------------------------------------------
Auto abort timer: inactive
--------------------------------------------------------------------------------

==================================================================================================

 

 

The question is, where do I go from here?   I could commit the upgrade.  But obviously the 2 errors "R1/0: Failed to launch boot task mount_packages.service (exit-code)"   and "R1/0: Critical boot tasks failed: * *"   mean that either some service couldn't mount something, or some file wouldn't boot.    I'm running in Installed mode, not bundle (however the standby sup is booted using the .bin file which is like bundled mode). 

Do I commit this upgrade, then do "install remove inactive" and make some drive space, and then copy the .bin file back to the bootflash: and try running the upgrade command again?  Maybe that would re-run the whole upgrade process and fix what isn't happy?    Do I wipe the bootflash and then expand the files back onto it?    Do I copy all of the files one by one off the good SUP onto the other SUP and then try to boot?     If I had a spare chassis I'd throw the sup in it, wipe it, install the same software on it, and then stick it back in the original chassis and let it pair back up, but I don't have that option right now. 

Any good ideas on how to go on from here so that I can be confident that the next upgrade goes well? Or, do I just ride it out until the next inevitable upgrade and cross my fingers during that one?

7 Replies 7

Leo Laohoo
Hall of Fame
Hall of Fame
install abort issu

NOTE:  
Personally, I do not like nor recommend ISSU, FSU/eFSU/xFSU.  My reason is because I have personally seen too many "code brown" moments (where I work).  And in this forum, a few of us have been fixing other people's "code brown" too.  I have always maintained a position that ISSU, FSU/eFSU/xFSU only works in "corner cases", a lab environment or a demo.  

RVTim
Level 1
Level 1

I understand. I do feel that way on the 4500 platform.  Also, on the 6500 platform I'd done them for years and found that *if* you do some additional steps, the upgrade can go just fine.  But you have to manage the upgrade, rather than let it do it itself.    I haven't seen it go bad on the 9400's yet.  This problem I'm not sure if it was caused by the ISSU process or just a fluke.  On these, if you don't do ISSU, what method do you like to use?  Do you just remove the "ISSU" from the command and then do a complete reload?  If that is the case, then I don't know if I'd have had more success.  My gut feeling is that something went wrong when it installed the package, which would seem to happen in installed mode as a possibility anyway.  If booted in bundled mode probably not.  I've never run a 9300/9400 series in bundled mode though, so I don't have experience with that.

I developed my own method of upgrading the firmware of the 9500, ASR & ISR routers, 9800 controllers (without using PI or DNAC) because the Cisco "recommended" method does not give me the flexibility to reboot the 9500 on a later date.  The method I have developed is called One-Hit-Wonder (NSFW version) (see attachment).   NSFW because it is not a Cisco recommended procedure.  

Since the development of this process, I have been testing and polishing the process for 4 years and I have not "lost" any appliances because my process is broken.  It works.  I unpack the packages any time during business hours and, for example, schedule a reboot at 7am the next morning.  

RVTim
Level 1
Level 1

 Leo,

The process looks simple enough.  I see the benefit because you can then do "reload at xx:xx" and schedule it.   I must admit, after doing this for 20+ years, once Cisco switched to the package files, I never really understood what was going on during the boot process with all those files. So, when you rename the packages.conf file and then obviously you're booting the firmware.conf file, I don't know what ramifications that has.  I don't know the difference between the 2.  What would have happened if you just left it to boot Packages.conf?  Shouldn't that also reboot the switch, or because you booted that file would it get upset and not continue?  Also, are you in effect then booting it similar to booting the .bin file?    I did read that booting the .bin file requires more memory because it has to load the whole file into memory.   Not sure if that matters a lot though.  Anyway, it's nice to see that the process works for you.  If I had anything that wasn't in production, I'd try it some day.


@RVTim wrote:
So, when you rename the packages.conf file and then obviously you're booting the firmware.conf file, I don't know what ramifications that has.  I don't know the difference between the 2.

Same reason as people's option/choice to use Bundle Mode.  Every time one has to upgrade the firmware, in Bundle Mode, one of the process is to replace the boot variable string to point to the new BIN file.  If the boot variable string syntax is done incorrectly, either the platform boots the wrong firmware or, worse, boots into ROMMON (CSCvg37458).  

But if the boot variable string always points to "packages.conf", I can just rename the "firmware.conf" into "packages.conf".  It is simpler but, most importantly, minimizes the risk of having the wrong boot variable string syntax.  


@RVTim wrote:
I did read that booting the .bin file requires more memory because it has to load the whole file into memory.   Not sure if that matters a lot though. 

IOS-XE memory leaks like no tomorrow.  Anything to minimize memory utilization is good in my book.  

It is not possible to apply SMU if the appliance has boot into Bundle Mode.  Take, for example, the SMU to fix CSCwh87343 (Software Fix Availability for Cisco IOS XE Software Web UI Privilege Escalation Vulnerability - CVE-2023-20198).  If the appliance boot in Bundle Mode, SMU cannot be applied.  

And one last, very important thing, there is a bug feature that I want to share.  Notice the "gotcha" sections?  It does not matter what method is used (Cisco-recommended or One-Hit-Wonder), there will be occasions where the system will refuse or unable to rename the existing packages.conf file.   The Cisco recommended method does not have this check and this causes to appliance to reboot into the current, not intended, version.  The process I have developed incorporates a manual, albeit archaic, method to double-check.

Leo Laohoo
Hall of Fame
Hall of Fame

@RVTim

Although CSCwh76420 only applies to Catalyst 9800 WLC, it is still worth noting.  

CSCwe62246 is a bug for 9400 and ISSU. 

RVTim
Level 1
Level 1

I just wanted to follow up on this that I did get the system back working fine.  I did a few things and as I was doing them the sup was able to boot  unassisted.

1) I did an "install remove inactive".

2) I copied the .bin file again to the active bootflash:  (slot 5)

3) I ran the install command again with the word "force" appended.  It didn't seem to do much.

The slot 6 Sup still wouldn't boot on its own.

4) I manually booted slot 6 SUP

5) I did a force-switchover to make slot 6 active

6) I again did the install command again with force.  It didn't seem to do anything.

7) I began the CPLD updates by updating the standby (Slot 5)

I failed over and updated the CPLD on slot 6.   This time it booted itself fine.

9) I reloaded the entire chassis with "redundancy reload shelf"

Everything came up just fine and works good now.  Just wanted to throw that all out there in case someone reads this.

Review Cisco Networking for a $25 gift card