cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3155
Views
33
Helpful
39
Replies

9800-80 active and standby configuration out of sync - 17.12.4

eglinsky2012
Level 4
Level 4

I am attempting to upgrade our 9800-80 HA pairs to version 17.12.4 to resolve a certain bug. The upgrade from 17.9.5 went smoothly in the lab, but upgrading a pre-production pair from 17.9.4a has resulted in the standby being in a boot loop with the following message which occurs after a successful bulk sync:

Chassis 2 reloading, reason - Active and Standby configuration out of sync

This was a normal install-mode upgrade by GUI, not an ISSU upgrade.

I have notified TAC, but meanwhile, has anyone else experienced this and know of a resolution?

I have attached the console log from the bootup sequence to the point at which the sync issue and reboot occur.

2 Accepted Solutions

Accepted Solutions


@Rich R wrote:

Frankly I'm dubious that deleting the binary config (which file is that by the way?) will make any difference. 


The instructions were to run the following commands, both on active and standby. I was able to do this via console port since we have a console server, otherwise they suggested SSH to RMI IP.

 

delete /force /recursive bootflash:.dbpersist/persistent-config.tar.gz

delete /force /recursive bootflash:.dbpersist/persistent-config.meta-

 

Then reload the stack, both units together ("reload" command).

View solution in original post

eglinsky2012
Level 4
Level 4

@Rich R, @Leo Laohoo - APSP2 for 17.12.4 is now out: https://software.cisco.com/download/home/286321396/type/286325254/release/17.12.4

Not sure what happened to APSP1... and I didn't even get an email for this release, even though I've checked and re-checked my email notifications for each category of 9800 updates! Grrr.

In other news, here's the final verdict from TAC on my 17.12.4 (or perhaps, not software version-related) config sync issue:

I think the scenario you faced was just unfortunate [a fluke - EG]. Of course, it is certainly a best practice to delete the persistent database before upgrading a pair of WLC, so if you plan to upgrade another pair in the future you can follow the same steps to make sure everything will go smoothly.

View solution in original post

39 Replies 39

Leo Laohoo
Hall of Fame
Hall of Fame

Was the upgrade performed using ISSU?

Leo Laohoo
Hall of Fame
Hall of Fame

@eglinsky2012 wrote:
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007f00 MISC 228aa040101086 
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1724272172 SOCKET 0 APIC 0 microcode 2006b06
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007fc0 MISC 228aa040101086 
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1724272172 SOCKET 0 APIC 0 microcode 2006b06
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 11: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007f80 MISC 228aa040101086 
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1724272172 SOCKET 0 APIC 0 microcode 2006b06

Can WLC-2 be cold-rebooted?

NOTE:  MCE stands for "machine check (for) errors".  

@Leo Laohoo This was a normal install-mode upgrade by GUI, not an ISSU upgrade.

I have not tried power cycling yet. I want to get TAC to look at it first in case they need to pull any logs or debugs off it first. I’ll be opening a new case for it tomorrow and will update after things progress. This WLC pair isn’t in use yet (no APs joined), so no big deal in the meantime.

Not related, however, was the ROMMON upgraded to 17.12(2r)?

@Leo Laohoo Yes, it was, several weeks back while still on 17.9.4a software.

Nice!

Let us know if WLC-2 can be cold rebooted. I think this could be the answer to the issue. 

marce1000
VIP
VIP

 

  From your attachment (file):
            >... %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee2000000003110a
*Aug 21 20:29:41.154: %IOSXE-0-PLATFORM: Chassis 2 R0/0: kernel: mce: [Hardware Error]: TSC 0 ADDR ff007f00 MISC 228aa040101086

   - Errors like that are very worry some and point to hardware problems on the controller reported them , it should certainly
  be forwarded to TAC too !!
  The following commands can be useful : (these are only CLI-available commands , not links)
                    show logging profile hardware-diagnostics
                    show  facility-alarm status
                    show platform hardware slot R0  led status
                    show platform hardware slot R0  alarms visual
                       show platform software system all
                       show platform resources
                       show environment chassis active r0  
                       show environment
                       show environment summary
                       show environment chassis active r0  
                       show platform hardware slot R0  dram statistics 
                       show logging onboard dram
                       show logging onboard slot 0 dram
                       show logging onboard slot 0 uptime
                       show logging onboard slot 0 voltage

 M.




                    



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Rich R
VIP
VIP

We first reported those errors to TAC in 2021.  They even RMA'd a 9800-80 for EFA because of them.  Then after about 6 months of "investigation" by the BU: "I just wanted to inform you that the BU is still checking this issue, nevertheless they confirmed that the error messages you are seeing are just cosmetic and there is no impact on the WLC operations."
then: "We filed this bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwa98628
Since it’s a cosmetic bug it will take some time to have a fix on it."
CSCwa98628 is dup'd to https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvy53719

As you can see there are already 112 TAC cases attached to those 2 bugs since we originally raised it but BU apparently don't have any intention of fixing the issue (which should be really easy to fix right?).  It's very irritating because those errors cause critical alerts on the GUI after a reboot/upgrade but you can just clear and ignore them!  So the original problem has nothing to do with those errors.

Back to the original problem - I think the root cause is:
Aug 21 20:34:32.931: %SPA_OIR-6-OFFLINECARD: SPA (C9800-2X40GE) offline in subslot 0/1
Aug 21 20:34:32.936: %IOSXE_OIR-6-INSCARD: Card (fp) inserted in slot F0TSM Hook for PRE PLUGIN ANALYZE failed for slot/bay (00), status = 17
Aug 21 20:34:33.402: %IOSXE_OIR-3-SPA_INTF_ID_ALLOC_FAILED: Failed to allocate interface identifiers forSPA(BUILT-IN-6X10G/2X1G) in slot/bay: 0/0TSM Hook for PRE PLUGIN ANALYZE failed for slot/bay (01), status = 17
Aug 21 20:34:33.428: %IOSXE_OIR-3-SPA_INTF_ID_ALLOC_FAILED: Failed to allocate interface identifiers forSPA(C9800-2X40GE) in slot/bay: 0/1
So the hardware does not match and therefore the config cannot match - that is the cause of "out of sync".
The only bug which looks similar is https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwj07316 but that is only on routers (according to bug DB) and is fixed in 17.12.4 <smile>
I suspect a power cycle will solve it.

jasonm002
Level 1
Level 1

Looks like you're headed for an RMA with this one. I'd TAC it and note what Rich said about the interface error messages on boot.

Also note that 17.12.4 has a regression that will crash most ax/6/6E APs. There is a APSP1 available for it but it's not on the public downloads page yet so ask TAC for it. See https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwj77042

@jasonm002Thank you for bringing that up! I'm not going to pull the trigger on 17.12.4 in production in the immediate future since the original issue at hand has not been resolved. Not much to report on that. TAC told me last Monday to delete the persistent binary config files from both active and standby, reboot both units, and compare the running config on both. I got delayed by other issues but did that Friday afternoon and sent the config files for analysis. The files seemed to match (except for the RMI IPs and the active having the shared IP and the standby did not, which seems normal). I asked TAC how to proceed but still haven't heard back, so I followed up again this morning (Tuesday). We'll see what happens.

@Rich R, @Leo Laohoo, @marce1000 - I forgot to update on this sooner, but both units reverted back to 17.9.4a overnight after the failed upgrade attempt. I know this is because I never committed 17.12.4. I never power-cycled them, but they re-synced successfully once back on 17.9.4a and remained stable for over a week until I did the aforementioned persistent binary config deletion and subsequent reboot, upon which they once again synced successfully. I suspect TAC will have me try 17.12.4 again, and hopefully the persistent binary config clear will allow it to work this time, but I'm awaiting on instructions from them.

I'll be very interested in the outcome because I'm poised to upgrade to 17.12.4 to resolve a bug which has crashed one of mine 6 times in the last week.  Fortunately HA-SSO worked correctly so no noticeable impact to APs or clients.

Frankly I'm dubious that deleting the binary config (which file is that by the way?) will make any difference. 

It actually looks suspiciously like the ill-fated 17.4.1r ROMMON https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvz25229 which the BU eventually withdrew after not believing for months that it bricked 9800-40/80 WLCs.  Questions about how much it had been tested before release got ambiguous and evasive answers ...

I've got memory leak problems with 17.12.3 and we are preparing to move to 17.12.4.

The memory leak in the control-plane is due to DNA Spaces (CSCwj93876) and TAC/developers are prepared to release an SMU for 17.12.4, however, I do not upgrade now and wait for the SMU to be released in three weeks time.  This bug is not just present in 17.12.3 but also in 17.12.4.  

Same goes with an APSP to fix an issue where our APs are continuously spamming the logs of our switches with "duplex mismatch" errors (CSCwj66264).  


@Rich R wrote:
Is there a bug ID for the duplex issue @Leo Laohoo ?

It is CSCwj66264.

Review Cisco Networking for a $25 gift card