cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
4570
Views
33
Helpful
39
Replies

9800-80 active and standby configuration out of sync - 17.12.4

eglinsky2012
Spotlight
Spotlight

I am attempting to upgrade our 9800-80 HA pairs to version 17.12.4 to resolve a certain bug. The upgrade from 17.9.5 went smoothly in the lab, but upgrading a pre-production pair from 17.9.4a has resulted in the standby being in a boot loop with the following message which occurs after a successful bulk sync:

Chassis 2 reloading, reason - Active and Standby configuration out of sync

This was a normal install-mode upgrade by GUI, not an ISSU upgrade.

I have notified TAC, but meanwhile, has anyone else experienced this and know of a resolution?

I have attached the console log from the bootup sequence to the point at which the sync issue and reboot occur.

39 Replies 39


@Rich R wrote:

Frankly I'm dubious that deleting the binary config (which file is that by the way?) will make any difference. 


The instructions were to run the following commands, both on active and standby. I was able to do this via console port since we have a console server, otherwise they suggested SSH to RMI IP.

 

delete /force /recursive bootflash:.dbpersist/persistent-config.tar.gz

delete /force /recursive bootflash:.dbpersist/persistent-config.meta-

 

Then reload the stack, both units together ("reload" command).

Thanks for the info about CSCwj77042 @jasonm002 I will ask for that before going for 17.12.4

ps I've already queried why some of the fixes in 17.9.5 APSP5 are missing from 17.12.4!  And already confirmed that some are fixed but they haven't updated the bug database which seems to be a fairly frequent thing these days!

Leo Laohoo
Hall of Fame
Hall of Fame

Just want to add that we've also hit this bug when control-plane memory utilization is north of 45%:  CSCwi78109

We've observed this bug to be present in 17.9.X and 17.12.3.

 

Yes indeed... I had that one pointed out to me by our SE because they have published a SMU for CSCwi78109 on 17.12.4.  I tried testing the SMU on 9800-CL in lab on Monday and it left the WLC unbootable!  I recovered it from console by reverting to golden image, then deleting the SMU, then clearing install state!  Haven't had a chance to have another try or on 9800-80 yet but decided to just use the workaround since we don't need nmsp enabled <smile> so proceed cautiously with that SMU

Thanks for the tip!

eglinsky2012
Spotlight
Spotlight

@Leo Laohoo, @Rich R, where does this leave us? is 17.12.4 worth a try, with the CSCwi78109  SMU and APSP1 for CSCwj77042 applied? Any other known debilitating issues?

I flat-out asked TAC what to do, since 17.9.5/APSP5 seems to have made the CSCwj45141 / CSCwk48338 issue worse. We've even had to reboot a bunch of 2800s over the last couple weeks.

Yet we have all these issues on 17.12. I'm flat-out scared of either 17.12.3 or 17.12.4 at this point and seriously contemplating going back to 17.9.4a/APSP8. We had to reboot our high-density 9100 series on a schedule, otherwise we didn't really have any other client-affecting issues.


@eglinsky2012 wrote:
is 17.12.4 worth a try

Unfortunately, we do not have a choice. 

At the end of the day, it all boils down to poor coding and none-existence quality control.  And both factors are outside our control. 

 


@eglinsky2012 wrote:
We've even had to reboot a bunch of 2800s over the last couple weeks.

If this works, daily/weekly reboot of the 2800/3800/4800/1560 (cold-reboot is better) and bi-yearly/yearly reboot of the controllers would be ideal.  At the very best, the bugs become "familiar" and everyone has a known method to do perform the workaround.  Going to 17.12.X is going to be a big risk because everyone will have to "help Cisco find bugs".  

Agreed - I'm working on the assumption that 17.12.4 is the best of the bad lot at present.
We've had 91xx  5GHz radios silently stop responding (no errors, no logs, WLC still thinks the radio is up and working but zero clients) on 17.9.4 APSP6 requiring reboots, so you can't win either way.

I read the daily bug reports and cry myself to sleep.  

 

 @Leo Laohoo - I do the same for my government's tax  bills - LOL !

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '


@Rich R wrote:
We've had 91xx  5GHz radios silently stop responding (no errors, no logs, WLC still thinks the radio is up and working but zero clients) 

Smells like CSCvx56223.

Yes except that CSCvx56223 *should* be fixed in 17.9.4 ...

eglinsky2012
Spotlight
Spotlight

@Rich R @Leo Laohoo A couple updates.

Still haven't heard back from TAC on next steps, but I proceeded upgrading the pre-production controller pair from 17.9.4a to 17.9.5/APSP5/the first 3 published SMUs (I now see that there are 2 more I wasn't aware of). Then I went to 17.12.4 and the CSCwi78109 NVGEN error SMU. No more config sync issues so far on that one, however, there were a couple surprises:

 

1. Apparently I also forgot to commit 17.12.4 in the lab. When I went to install the SMU there, I found it back on 17.9.5. Upgrading again to 17.12.4 yielded a familiar issue: Standby rebooting due to config sync issue! Just like the other, pre-production controller. That didn't happen after the first time I upgraded it, at least not that I realized... perhaps I missed it. Anyway, I issued the below commands on both the active and the standby (in the very short window between CLI availability and the config sync issue/reboot on the standby):

delete /force /recursive bootflash:.dbpersist/persistent-config.tar.gz

delete /force /recursive bootflash:.dbpersist/persistent-config.meta-

... then rebooted the active while the standby was also starting to reboot, and the units synced successfully upon booting back up together. However, after rebooting the units for SMU (more on that below), once again, reload due to "Active and Standby configuration out of sync". I did not intervene this time. When the standby started rebooting, it rebooted again early in the boot process ("system requested reload"), after the chassis discovery. See attached abridged console output, from the initial config sync issue to the second reboot). After the second reboot, they once again synced and stayed running. I'll see if it remains stable overnight and do some test reloads tomorrow to see if they stay stable.

 

2. The CSCwi78109 NVGEN error SMU for 17.12.4 is NOT hitless, it's a reload SMU! The software downloads page lies:

eglinsky2012_0-1726173974532.png

Per the WLC, it is reload (and my WLCs did in fact reload):

eglinsky2012_1-1726174091800.png


@eglinsky2012 wrote:
2. The ... SMU for 17.12.4 is NOT hitless, 

1.  Another evidence points that developers do not test their codes.  

2.  Always assume SMU is NEVER "hitless".

When you tested the lab WLC, did you see the bootup in console?  


@Leo Laohoo wrote:

When you tested the lab WLC, did you see the bootup in console?  

Yes, that snippet I attached in my previous message was from the console. I can provide the full output from active and standby if there's interest, but it's much the same as the one I posted at the beginning of this thread.

Review Cisco Networking for a $25 gift card