05-11-2021 03:59 AM - edited 07-05-2021 01:17 PM
Hi board,
assuming in a WLC C9800 SSO cluster with RMI (SW version 17.3.3), you want to replace a failed cluster member... what is the preferred way to do it?
I tried the following and it was a disaster:
Assuming chassis2 has failed completely and needs to be integrated into the cluster:
1.) Make sure the SW version on the new chassis is the same (install mode)
2.) Cable the new factory default chassis to the network (channel uplink and RP)
3.) Assign the correct chassis number to the new chassis
# Exec mode chassis 2 priority 1 reload
==> Chassis boots up with chassis number 2
4.) Base SSO configuration:
interface Vlan<MANAGEMENT-VLAN-ID> ip address 192.168.0.1 255.255.255.0 no shutdown ! redun-management interface Vlan<MANAGEMENT-VLAN-ID> chassis 1 address <RMI-IPv4-CHASSIS1> chassis 2 address <RMI-IPv4-CHASSIS2> ! end ! write memory
After this, the following log came up on the newly (factory default) chassis:
WLC#wr Building configuration... [OK] WARNING: Reload HA Chassis for RMI configuration to take effect WLC# *May 11 10:46:16.688: %SYS-6-PRIVCFG_ENCRYPT_SUCCESS: Successfully encrypted private config file Chassis 2 reloading, reason - stack merge May 11 10:46:20.850: %PMAN-5-EXITACTION: C0/0: pvp: Process manager is exiting: May 11 10:46:21.194: %PMAN-5-EXITACTION: F0/0: pvp: Process manager is exiting: May 11 10:46:42.457: %PMAN-5-EXITACTIONvp: Process manager is exiting: process exit with reload fru code May 11 10:46:54.985: %PMAN-3-PROCESS_NOTIFICATION: R0/0: pvp: System report core/WLC_2_RP_0-system-report_20210511-104649-UTC.tar.gz (size: 12529 KB) generated and System report info at core/WLC_2_RP_0-system-report_20210511-104649-UTC-info.txt Initializing Hardware ...
==> Chassis 2 reboot
However (and that's the problem), chassis 1 reboots as well!
=> Wireless service disruption, because both chassis are booting at the same time.
I would assume, only chassis 2 is rebooting and integrates itself into the cluster...
Am I doing something wrong here, or may I hit a bug here?
05-11-2021 05:50 AM
Yes, its expected for existing Primary/Active WLC to reboot as well for the first time HA pairing, so the setup should be performed at change window. Replacing with new/different WLC in the setup is going to be similar to initial pairing, otherwise it will stuck in Maintenance mode that require manual intervention ie., reboot both WLC at same time, anyway.
05-11-2021 07:41 AM
Wow ... needing a maintenance window to replace a failed chassis .... .... .. I lost my faith in current products ....
Thank you for the answer!!!
05-11-2021 07:56 AM
Just out of curiosity, did you verify that chassis 1 had priority set to 2? If I recall correctly, the chassis 2 should come up and become the standby and then the chassis 1 should reboot and chassis 2 becomes active with no interruptions. Now there is that chance that chassis 2 could restart like what happened in your case, but not suppose to.
You can verify and or test by brining up a couple 9800-CL's and see what happens also.
05-12-2021 12:32 AM
Hey Scott,
so from my point of view, the priorities are only relevant in the election process, when both WLCs are booting.
But in my case, I set the priorities like recommended in the HA paper:
myWLC#show chassis Chassis/Stack Mac Address : f4bd.abcd.f660 - Local Mac Address Mac persistency wait time: Indefinite Local Redundancy Port Type: Twisted Pair H/W Current Chassis# Role Mac Address Priority Version State IP ------------------------------------------------------------------------------------- *1 Active f4bd.abcd.f660 2 V02 Ready 169.254.54.130 2 Standby f4bd.abcd.f5a0 1 V02 Ready 169.254.54.131
In either way: If my chassis#1 would have failed and I replace chassis#1 (with prio:2), I would not expect that chassis#1 take over. I would expect, that it integrates as chassis#1 secondary.
The SSO paper has a nice list, how the active WLC is chosen:
1. The wireless controllerthat is currently the active wireless controller
2. The wireless controller with the highest priority value.
3. The wireless controllerwith the shortest start-up time.
4. The wireless controller with the lowest MAC Address.
So based on the list - the currently active WLC should keep its role in any case (except it fails)..
05-12-2021 12:47 AM
05-12-2021 12:55 AM
05-11-2021 11:22 PM
https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-1/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-1.pdf
On C9800-40 and C9800-80 wireless controller, enable High Availability SSO using the following command on
each of the two wireless controller units
chassis redundancy ha-interface local-ip <local IP> <local IP subnet> remoteip <remote IP>
Reload both wireless controllers by executing the command reload from the CLI
Note: It is recommended to configure HA using the Redundancy Management Interface (RMI) starting Release 17.1. To see
configuration using RMI please see the Redundancy Management Interface section.
05-12-2021 01:49 AM
Have always rebooted both WLCs as part of initial bringup or replacement addition to avoid frustration particularly the RPs are connected across L2.
It appear, the new WLC trying to add itself to HA-stack as standby-hot initially(election process) for the first time require existing ACTIVE WLC to reboot to do initial sync at the bootup and all the other config database synced once fully booted. this initial scenario is different than failure scenario where both WLCs were already synced in the past. there's many .doc ref for this scenario. unable to find Cisco .doc ref mention that new/replaced WLC will sync with existing Active WLC without ACTIVE reboot, please point that out if found.
01-29-2025 04:59 PM - edited 01-29-2025 06:41 PM
Found this video which explains the replacement procedure for a hardware unit in HA setup, https://video.cisco.com/detail/video/6341318688112
Even though not 100% correct (renumbering of chassis is shown as done with the same #..) and audio quality could be better, it seems straightforward. Not sure why connecting network first and then reload, and not reload and immediately connect network and RP. Also we customised the initial config to suit our scenario, physical Ten interfaces in a LACP port-channel (configuring only the main VLAN is not enough), even though this may not be required if the config is pushed only via RP.
There's no timestamp of when video was recorded however version shown is 17.9.4a which is pretty recent. Hopefully it works the same way with older versions, we have 17.6.5 which we were upgrading to 17.9.6 but got stuck as one chassis died, courtesy of FN74160. We'll test the procedure today and let you know how it went.
UPDATE: it worked as expected, no impact. The only thing we did differently was to have the replacement unit off, connecting both network and RP and then power on. Good luck, and check if your unit is affected by this FN, both our units were (with one actually impacted).
PS: as mentioned above, our standby WLC didn't come back after warm reload triggered by ISSU.
We couldn't abort ISSU until standby unit was up, even though abort timer had expired.
After bringing the standby unit back, with the old IOS-XE (as we didn't want to proceed with upgrade at this time), we still had to abort ISSU.
01-29-2025 10:11 PM
Sorry, lost track on this one
Key point is (like @Scott Fella assumed), that the still active chassis should get a priority of "2" before integrating the new replacement chassis with priority 1:
#! Still active chassis:
chassis <ACTIVE-CHASSIS-ID> priority 2
#! Replacement chassis (factory default)
chassis <REPLACEMENT-CHASSIS-ID> priority 1
If someone wants - for any reason, the priorities can be changed back to the previous state (before the failure), after the SSO cluster is fully functional again.
I guess that Cisco published somewhere an updated guide for this (some time after my initial post and a TAC case)
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide