Solved: 9800-80 active and standby configuration out of sync - 17.12.4

eglinsky2012 · ‎08-21-2024

I am attempting to upgrade our 9800-80 HA pairs to version 17.12.4 to resolve a certain bug. The upgrade from 17.9.5 went smoothly in the lab, but upgrading a pre-production pair from 17.9.4a has resulted in the standby being in a boot loop with the following message which occurs after a successful bulk sync:

Chassis 2 reloading, reason - Active and Standby configuration out of sync

This was a normal install-mode upgrade by GUI, not an ISSU upgrade.

I have notified TAC, but meanwhile, has anyone else experienced this and know of a resolution?

I have attached the console log from the bootup sequence to the point at which the sync issue and reboot occur.

eglinsky2012 · ‎09-10-2024

@Rich R wrote:
Frankly I'm dubious that deleting the binary config (which file is that by the way?) will make any difference.

The instructions were to run the following commands, both on active and standby. I was able to do this via console port since we have a console server, otherwise they suggested SSH to RMI IP.

delete /force /recursive bootflash:.dbpersist/persistent-config.tar.gz

delete /force /recursive bootflash:.dbpersist/persistent-config.meta-

Then reload the stack, both units together ("reload" command).

View solution in original post

eglinsky2012 · ‎09-20-2024

@Rich R, @Leo Laohoo - APSP2 for 17.12.4 is now out: https://software.cisco.com/download/home/286321396/type/286325254/release/17.12.4

Not sure what happened to APSP1... and I didn't even get an email for this release, even though I've checked and re-checked my email notifications for each category of 9800 updates! Grrr.

In other news, here's the final verdict from TAC on my 17.12.4 (or perhaps, not software version-related) config sync issue:

I think the scenario you faced was just unfortunate [a fluke - EG]. Of course, it is certainly a best practice to delete the persistent database before upgrading a pair of WLC, so if you plan to upgrade another pair in the future you can follow the same steps to make sure everything will go smoothly.

View solution in original post

Leo Laohoo · ‎08-21-2024

Was the upgrade performed using ISSU?

Leo Laohoo · ‎08-21-2024

@eglinsky2012 wrote:

*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007f00 MISC 228aa040101086 
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1724272172 SOCKET 0 APIC 0 microcode 2006b06
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007fc0 MISC 228aa040101086 
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1724272172 SOCKET 0 APIC 0 microcode 2006b06
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 11: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007f80 MISC 228aa040101086 
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1724272172 SOCKET 0 APIC 0 microcode 2006b06

Can WLC-2 be cold-rebooted?

NOTE: MCE stands for "machine check (for) errors".

eglinsky2012 · ‎08-21-2024

@Leo Laohoo This was a normal install-mode upgrade by GUI, not an ISSU upgrade.

I have not tried power cycling yet. I want to get TAC to look at it first in case they need to pull any logs or debugs off it first. I’ll be opening a new case for it tomorrow and will update after things progress. This WLC pair isn’t in use yet (no APs joined), so no big deal in the meantime.

Leo Laohoo · ‎08-21-2024

Not related, however, was the ROMMON upgraded to 17.12(2r)?

eglinsky2012 · ‎08-21-2024

@Leo Laohoo Yes, it was, several weeks back while still on 17.9.4a software.

Leo Laohoo · ‎08-21-2024

Nice!

Let us know if WLC-2 can be cold rebooted. I think this could be the answer to the issue.

marce1000 · ‎08-21-2024

From your attachment (file):
>... %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007f00 MISC 228aa040101086

- Errors like that are very worry some and point to hardware problems on the controller reported them , it should certainly
be forwarded to TAC too !!
The following commands can be useful : (these are only CLI-available commands , not links)
  show logging profile hardware-diagnostics
  show facility-alarm status
  show platform hardware slot R0 led status
  show platform hardware slot R0 alarms visual
show platform software system all
show platform resources
show environment chassis active r0
show environment
show environment summary
show environment chassis active r0
show platform hardware slot R0 dram statistics
show logging onboard dram
show logging onboard slot 0 dram
show logging onboard slot 0 uptime
show logging onboard slot 0 voltage

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Rich R · ‎09-01-2024

We first reported those errors to TAC in 2021. They even RMA'd a 9800-80 for EFA because of them. Then after about 6 months of "investigation" by the BU: "I just wanted to inform you that the BU is still checking this issue, nevertheless they confirmed that the error messages you are seeing are just cosmetic and there is no impact on the WLC operations."
then: "We filed this bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwa98628
Since it’s a cosmetic bug it will take some time to have a fix on it."
CSCwa98628 is dup'd to https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvy53719

As you can see there are already 112 TAC cases attached to those 2 bugs since we originally raised it but BU apparently don't have any intention of fixing the issue (which should be really easy to fix right?). It's very irritating because those errors cause critical alerts on the GUI after a reboot/upgrade but you can just clear and ignore them! So the original problem has nothing to do with those errors.

Back to the original problem - I think the root cause is:
Aug 21 20:34:32.931: %SPA_OIR-6-OFFLINECARD: SPA (C9800-2X40GE) offline in subslot 0/1
Aug 21 20:34:32.936: %IOSXE_OIR-6-INSCARD: Card (fp) inserted in slot F0TSM Hook for PRE PLUGIN ANALYZE failed for slot/bay (00), status = 17
Aug 21 20:34:33.402: %IOSXE_OIR-3-SPA_INTF_ID_ALLOC_FAILED: Failed to allocate interface identifiers forSPA(BUILT-IN-6X10G/2X1G) in slot/bay: 0/0TSM Hook for PRE PLUGIN ANALYZE failed for slot/bay (01), status = 17
Aug 21 20:34:33.428: %IOSXE_OIR-3-SPA_INTF_ID_ALLOC_FAILED: Failed to allocate interface identifiers forSPA(C9800-2X40GE) in slot/bay: 0/1
So the hardware does not match and therefore the config cannot match - that is the cause of "out of sync".
The only bug which looks similar is https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwj07316 but that is only on routers (according to bug DB) and is fixed in 17.12.4 <smile>
I suspect a power cycle will solve it.

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

jasonm002 · ‎09-10-2024

Looks like you're headed for an RMA with this one. I'd TAC it and note what Rich said about the interface error messages on boot.

Also note that 17.12.4 has a regression that will crash most ax/6/6E APs. There is a APSP1 available for it but it's not on the public downloads page yet so ask TAC for it. See https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwj77042

eglinsky2012 · ‎09-10-2024

@jasonm002Thank you for bringing that up! I'm not going to pull the trigger on 17.12.4 in production in the immediate future since the original issue at hand has not been resolved. Not much to report on that. TAC told me last Monday to delete the persistent binary config files from both active and standby, reboot both units, and compare the running config on both. I got delayed by other issues but did that Friday afternoon and sent the config files for analysis. The files seemed to match (except for the RMI IPs and the active having the shared IP and the standby did not, which seems normal). I asked TAC how to proceed but still haven't heard back, so I followed up again this morning (Tuesday). We'll see what happens.

@Rich R, @Leo Laohoo, @marce1000 - I forgot to update on this sooner, but both units reverted back to 17.9.4a overnight after the failed upgrade attempt. I know this is because I never committed 17.12.4. I never power-cycled them, but they re-synced successfully once back on 17.9.4a and remained stable for over a week until I did the aforementioned persistent binary config deletion and subsequent reboot, upon which they once again synced successfully. I suspect TAC will have me try 17.12.4 again, and hopefully the persistent binary config clear will allow it to work this time, but I'm awaiting on instructions from them.

Rich R · ‎09-10-2024

I'll be very interested in the outcome because I'm poised to upgrade to 17.12.4 to resolve a bug which has crashed one of mine 6 times in the last week. Fortunately HA-SSO worked correctly so no noticeable impact to APs or clients.

Frankly I'm dubious that deleting the binary config (which file is that by the way?) will make any difference.

It actually looks suspiciously like the ill-fated 17.4.1r ROMMON https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvz25229 which the BU eventually withdrew after not believing for months that it bricked 9800-40/80 WLCs. Questions about how much it had been tested before release got ambiguous and evasive answers ...

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

Leo Laohoo · ‎09-10-2024

I've got memory leak problems with 17.12.3 and we are preparing to move to 17.12.4.

The memory leak in the control-plane is due to DNA Spaces (CSCwj93876) and TAC/developers are prepared to release an SMU for 17.12.4, however, I do not upgrade now and wait for the SMU to be released in three weeks time. This bug is not just present in 17.12.3 but also in 17.12.4.

Same goes with an APSP to fix an issue where our APs are continuously spamming the logs of our switches with "duplex mismatch" errors (CSCwj66264).

Rich R · ‎09-10-2024

Is there a bug ID for the duplex issue @Leo Laohoo ?

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

Leo Laohoo · ‎09-10-2024

@Rich R wrote:
Is there a bug ID for the duplex issue @Leo Laohoo ?

It is CSCwj66264.