cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1534
Views
15
Helpful
28
Replies

C9800-80 in HA cannot install APSP6

revision6420
Level 1
Level 1

Hello everyone,

I'm trying to figure out a frustrating issue. 

Attempting an install of APSP6 keeps bombing out and I'm unsure what steps to take next. 

Software: Version 17.09.04a

ROM: 17.3(3r)

Error in question:

 

 

Dec 13 21:21:01 Eastern: %INSTALL-3-OPERATION_ERROR_MESSAGE: Chassis 1 R0/0: install_engine: Failed to install_add package bootflash:C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin, Error: FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin: Improper State./bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin not present. Please restore file for stability.

 

 

 

C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin (APSP1) is not on the controller at all when I try to install C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin (APSP6) 

I have tried from the webui and cli to install the APSP and it fails each time. 

C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin is mentioned in an error each time, so clearly there is something deeper in the controller that is wrong. 

I have attempted failovers and reloads and during bootup, I also get this error. 

 

 

RSA Signed RELEASE Image Signature Verification Successful.

Image validated

Dec 14 01:48:55.694: %BOOT-3-BOOTTIME_SMU_MISSING_DETECTED: R0/0: install_engine: SMU file /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin missing and system impact will be unknown

 

 

 

This is my show install summary 

 

 

 

show install summary
[ Chassis 1/R0 2/R0 ] Installed Package(s) Information:
State (St): I - Inactive, U - Activated & Uncommitted,
            C - Activated & Committed, D - Deactivated & Uncommitted
--------------------------------------------------------------------------------
Type  St   Filename/Version
--------------------------------------------------------------------------------
IMG   C    17.09.04a.0.6

--------------------------------------------------------------------------------
Auto abort timer: inactive
--------------------------------------------------------------------------------

 

 

 

I have even tried reinstalling APSP1. 

Below are the webui logs of that attempt. 

 

 

NSTALL ADD OPERATION:

--- Analyzing file C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin ---
Package Type is APSP
Initiating INSTALL_ADD operation for the package C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin
install_add: START Wed Dec 13 21:28:41 Eastern 2023
install_add: Adding SMU
install_add: Checking whether new add is allowed ....
install_add: install-add is allowed.

--- Starting initial file syncing ---
[1]: Copying bootflash:C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin from chassis 1/R0 to chassis 2/R0
[2]: Finished copying to chassis 2/R0
Info: Finished copying bootflash:C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin to the selected chassis
Finished initial file syncing

--- Starting SMU Add operation ---
Performing SMU_ADD on all members
[1] SMU_ADD package(s) on chassis 1/R0
FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin: Invalid installation package. Version 17.09.03 does not match with 17.09.04a.0.6.
[1] Finished SMU_ADD on chassis 1/R0
[2] SMU_ADD package(s) on chassis 2/R0

FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin: Invalid installation package. Version 17.09.03 does not match with 17.09.04a.0.6.
[2] Finished SMU_ADD on chassis 2/R0
Checking status of SMU_ADD on [1/R0 2/R0]
SMU_ADD: Passed on []. Failed on [1/R0 2/R0]
Finished SMU Add operation

FAILED: install_add exit(1) Wed Dec 13 21:29:53 Eastern 2023

 

 

 

Any help at this point would be greatly appreciated! 

 

 

 

 

 

 

28 Replies 28

@Leo Laohoo we can't just say "Do not upgrade to 17.9.4/17.9.4 or 17.9.5" without any explanation or alternatives.  17.9.5 is the current TAC recommended release which means it's working fine for most customers.  We've been running stable (no major issues) on 17.9.4 with SMUs & APSP6 since November last year.  Rather provide a specific warning on who should be wary of those releases and why.

The TAC recommended doc (link below) does include this warning:
"If you have 9162 APs, be aware of CSCwj45141 
which is an issue that started in 17.9.4APSP8"
That also affects 17.9.5 and the bug currently has 5 customer TAC cases attached to it.  So that is certainly one bug TAC is warning about in those specific releases.  Note that CSCwj45141 does not affect 17.12.3 so if CSCwj45141 is a concern then 17.12.3 is an alternative.

Apologies, let me expand further. 

Currently, I have two pairs of 9800-80 (HA SSO) on 17.9.5.  I have another two standalone pairs of 9800-80 in 17.9.5.  I have two pairs of 9800-L on 17.9.5 and another two 9800-L on 17.12.3.

Out of all of these, I have more than seven (7) TAC Case all about 17.9.5: 

1.  A 9800-L just went into "dark mode":  One evening, it stopped.  All the LEDs, including the uplinks, just went dark.  The console port stopped responding.  We had to cold-reboot the 9800-L TWICE.   No crash logs found.  

2.  One pair of 9800-80 (HA SSO), with 5450 APs, has a memory leak.  Several processes are the culprit and TAC is trying to find out what is triggering the memory leak.   The memory leak case started with 17.9.4, continued to 17.9.4a and still present with 17.9.5.

3.  Every 21 days, a pair of 9800-80 (HA SSO), randomly stops passing traffic to random APs.  Wireless clients are stuck in Authenticating state and without any IP addresses.  


@Rich R wrote:
Any of those problems seen on 17.12.3?

We have not seen anything sinister prop up in 17.12.3, however, we are still too early (uptime wise) to tell.  Ask me again 4 weeks from now.  

17.12.3: if you have 9130s check the output of:

show ap dot11 5ghz cleanair summary | i Down

and if you have a large number of 9130s with cleanair down on 5ghz, weirdly enough the solution is to disable cleanair on 2.4ghz globally and reboot the affected 9130s. Affected 9130s will also periodically disjoin from the WLC. 

other than that 17.12.3 has been pretty stable for me. Also running ROMMON 17.3(3r) on my 9800-80 pair. 

This issue is specific to 9130s.


@jasonm002 wrote:

and if you have a large number of 9130s with cleanair down on 5ghz, weirdly enough the solution is to disable cleanair on 2.4ghz globally and reboot the affected 9130s. Affected 9130s will also periodically disjoin from the WLC. 


I think this is CSCwh49406 and we've hit this with our 9130 on 17.12.2.  It took one week for the one particular 9130 to fully saturate an mGIG port to 100%.  (https://imgur.com/Odfo467)

According to the Release Notes, this bug is fixed in 17.12.3, hence, we upgraded to 17.12.3.  So far, we have not seen this bug re-appear (yet!).  

 


@jasonm002 wrote:

Also running ROMMON 17.3(3r) on my 9800-80 pair. 


17.12.(2r) is out.

Had ROMMON 17.12(2r) on lab since it came out in December.  Will include with next IOS upgrade on production.

We had to cold-reboot one of our 9800-80 to get the ROMMON 17.12.2r to install.  

CSCwh49406 was just for the syslog spam but it appears they never fixed the underlying issue that was causing the syslog spam - they just fixed the syslog spam. So even in 17.12.3 cleanair sensors are still getting disabled on 9130s in my deployment (and apparently 9130s are disjoining the WLC periodically if they become affected) - they're just not spamming syslog about it anymore.

So I would still recommend checking if you have cleanair sensors down on 9130s and if so, the workaround to that issue is the same as the workaround in CSCwh49406 even though this is apparently a different bug (disable cleanair on 2.4ghz and reboot affected 9130s).


@jasonm002 wrote:
CSCwh49406 was just for the syslog spam

Not for us. 

CSCwh49406 was chewing up 100% of an mGIG port (5 Gbps) and slowed down DHCP and DNS at a site.  If we put the AP in an isolation VLAN, everything normalizes.   

The workaround does not work 100%.  I even had TAC confirm what I was doing but the problem persisted until we upgraded, from 17.12.2, to 17.12.3.

Interesting that that bug (CSCwh49406) is a "moderate" severity, and pretty much every AP model and IOS/AireOS code version is indicated as affected, when it seems to only affect 9130s on 17.12.1 and .2. Documentation's really on point here.


@eglinsky2012 wrote:
Documentation's really on point here.

Boy, do I have a "face-palm" moment for you.  Make that a "triple face-palm".

Look at CSCwh74663 (https://imgur.com/a/FjDwF0o).  Look at the Known Affected Releases.  

Next, read the "Conditions" section, which specifically states: 

SW version (Please note, current Bug Search Tool shows wrong Affected Releases. Followings are the all releases affected by this defect): AireOS 008.10.185.3, 008.10.190.0, 008.5.182.11, 008.5.182.108(IRCM) IOS-XE 17.12.1, 17.9.4, 17.9.4a, 17.6.6, 17.6.6a, 17.3.8, 17.3.8a

WTF, dude?  Someone, from Cisco, knew that the Known Affected Releases contain "questionable" details but did not "give an f" to fix it but, rather, put this note down.  Da fuq?  Maybe the "left hand" and the "right hand" are not in speaking terms?  

Oh, and look at "Frequency" section:  "rarely seen".  I would say describing 28 Cases (as of 26 April 2024) as "rarely seen" as an "understatement" would you agree?

LeoLaohoo_1-1688698817768.jpeg

CSCwj45141 has been updated with 17.9.5 added into the Known Affect Release.

jasonm002
Level 1
Level 1

TAC is filing a bug report about SMU behavior regarding SMUs missing on flash that are referenced in rollback points for 9300s and we're working with them on a separate case about this behavior in 9800s but I have a feeling most of this is broken generic IOS XE so hopefully fixes for things on one platform will propagate to others. 

Review Cisco Networking products for a $25 gift card