C9800 SSO RMI pair: Chassis hardware replacement procedure

Johannes Luther · ‎05-11-2021

Hi board,

assuming in a WLC C9800 SSO cluster with RMI (SW version 17.3.3), you want to replace a failed cluster member... what is the preferred way to do it?

I tried the following and it was a disaster:

Assuming chassis2 has failed completely and needs to be integrated into the cluster:

1.) Make sure the SW version on the new chassis is the same (install mode)

2.) Cable the new factory default chassis to the network (channel uplink and RP)

3.) Assign the correct chassis number to the new chassis

# Exec mode
chassis 2 priority 1
reload

==> Chassis boots up with chassis number 2

4.) Base SSO configuration:

interface Vlan<MANAGEMENT-VLAN-ID>
 ip address 192.168.0.1 255.255.255.0
 no shutdown
!
redun-management interface Vlan<MANAGEMENT-VLAN-ID> chassis 1 address <RMI-IPv4-CHASSIS1> chassis 2 address <RMI-IPv4-CHASSIS2>
!
end
!
write memory

After this, the following log came up on the newly (factory default) chassis:

WLC#wr
Building configuration...
[OK]

WARNING: Reload HA Chassis for RMI configuration to take effect

WLC#
*May 11 10:46:16.688: %SYS-6-PRIVCFG_ENCRYPT_SUCCESS: Successfully encrypted private config file
Chassis 2 reloading, reason - stack merge
May 11 10:46:20.850: %PMAN-5-EXITACTION: C0/0: pvp: Process manager is exiting:
May 11 10:46:21.194: %PMAN-5-EXITACTION: F0/0: pvp: Process manager is exiting:
May 11 10:46:42.457: %PMAN-5-EXITACTIONvp: Process manager is exiting: process exit with reload fru code
May 11 10:46:54.985: %PMAN-3-PROCESS_NOTIFICATION: R0/0: pvp: System report core/WLC_2_RP_0-system-report_20210511-104649-UTC.tar.gz (size: 12529 KB) generated and System report info at core/WLC_2_RP_0-system-report_20210511-104649-UTC-info.txt




Initializing Hardware ...

==> Chassis 2 reboot

However (and that's the problem), chassis 1 reboots as well!

=> Wireless service disruption, because both chassis are booting at the same time.

I would assume, only chassis 2 is rebooting and integrates itself into the cluster...

Am I doing something wrong here, or may I hit a bug here?

saravlak · ‎05-11-2021

Yes, its expected for existing Primary/Active WLC to reboot as well for the first time HA pairing, so the setup should be performed at change window. Replacing with new/different WLC in the setup is going to be similar to initial pairing, otherwise it will stuck in Maintenance mode that require manual intervention ie., reboot both WLC at same time, anyway.

Johannes Luther · ‎05-11-2021

Wow ... needing a maintenance window to replace a failed chassis .... .... .. I lost my faith in current products ....

Thank you for the answer!!!

Scott Fella · ‎05-11-2021

Just out of curiosity, did you verify that chassis 1 had priority set to 2? If I recall correctly, the chassis 2 should come up and become the standby and then the chassis 1 should reboot and chassis 2 becomes active with no interruptions. Now there is that chance that chassis 2 could restart like what happened in your case, but not suppose to.

You can verify and or test by brining up a couple 9800-CL's and see what happens also.

-Scott
*** Please rate helpful posts ***

Johannes Luther · ‎05-12-2021

Hey Scott,

so from my point of view, the priorities are only relevant in the election process, when both WLCs are booting.

But in my case, I set the priorities like recommended in the HA paper:

myWLC#show chassis
Chassis/Stack Mac Address : f4bd.abcd.f660 - Local Mac Address
Mac persistency wait time: Indefinite
Local Redundancy Port Type: Twisted Pair
                                             H/W   Current
Chassis#   Role    Mac Address     Priority Version  State                 IP
-------------------------------------------------------------------------------------
*1       Active   f4bd.abcd.f660     2      V02     Ready                169.254.54.130
 2       Standby  f4bd.abcd.f5a0     1      V02     Ready                169.254.54.131

In either way: If my chassis#1 would have failed and I replace chassis#1 (with prio:2), I would not expect that chassis#1 take over. I would expect, that it integrates as chassis#1 secondary.

The SSO paper has a nice list, how the active WLC is chosen:

1. The wireless controllerthat is currently the active wireless controller
2. The wireless controller with the highest priority value.
3. The wireless controllerwith the shortest start-up time.
4. The wireless controller with the lowest MAC Address.

So based on the list - the currently active WLC should keep its role in any case (except it fails)..

saravlak · ‎05-12-2021

There is no pre-empt functionality with SSO meaning that when the previous Active wireless controller resumes operation, it will not take back the role as an Active wireless controller but will negotiate its state with the current Active wireless controller and transition to Hot-Standby state.

https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-1/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-1.pdf

Scott Fella · ‎05-12-2021

I understand… what I was trying to call out is something that I have run into. The controller isn’t supposed to reboot, but there is alway a chance that it will. If for example, you have experience with AireOS and SSO. There were problems with that too in which the client controller could reboot itself and or go into maintenance mode. So try to lab it out with 9800-CL’s and see if you are successful with a hardware replacement (new VM) or not.

-Scott
*** Please rate helpful posts ***

saravlak · ‎05-11-2021

https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-1/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-1.pdf
On C9800-40 and C9800-80 wireless controller, enable High Availability SSO using the following command on
each of the two wireless controller units

chassis redundancy ha-interface local-ip <local IP> <local IP subnet> remoteip <remote IP>

Reload both wireless controllers by executing the command reload from the CLI

Note: It is recommended to configure HA using the Redundancy Management Interface (RMI) starting Release 17.1. To see
configuration using RMI please see the Redundancy Management Interface section.

saravlak · ‎05-12-2021

Have always rebooted both WLCs as part of initial bringup or replacement addition to avoid frustration particularly the RPs are connected across L2.
It appear, the new WLC trying to add itself to HA-stack as standby-hot initially(election process) for the first time require existing ACTIVE WLC to reboot to do initial sync at the bootup and all the other config database synced once fully booted. this initial scenario is different than failure scenario where both WLCs were already synced in the past. there's many .doc ref for this scenario. unable to find Cisco .doc ref mention that new/replaced WLC will sync with existing Active WLC without ACTIVE reboot, please point that out if found.

Feds · ‎01-29-2025

Found this video which explains the replacement procedure for a hardware unit in HA setup, https://video.cisco.com/detail/video/6341318688112
Even though not 100% correct (renumbering of chassis is shown as done with the same #..) and audio quality could be better, it seems straightforward. Not sure why connecting network first and then reload, and not reload and immediately connect network and RP. Also we customised the initial config to suit our scenario, physical Ten interfaces in a LACP port-channel (configuring only the main VLAN is not enough), even though this may not be required if the config is pushed only via RP.
There's no timestamp of when video was recorded however version shown is 17.9.4a which is pretty recent. Hopefully it works the same way with older versions, we have 17.6.5 which we were upgrading to 17.9.6 but got stuck as one chassis died, courtesy of FN74160. We'll test the procedure today and let you know how it went.

UPDATE: it worked as expected, no impact. The only thing we did differently was to have the replacement unit off, connecting both network and RP and then power on. Good luck, and check if your unit is affected by this FN, both our units were (with one actually impacted).

PS: as mentioned above, our standby WLC didn't come back after warm reload triggered by ISSU.
We couldn't abort ISSU until standby unit was up, even though abort timer had expired.
After bringing the standby unit back, with the old IOS-XE (as we didn't want to proceed with upgrade at this time), we still had to abort ISSU.

Johannes Luther · ‎01-29-2025

Sorry, lost track on this one For some time now I found a workaround procedure - preventing that both chassis reboot, when replacing one chassis.

Key point is (like @Scott Fella assumed), that the still active chassis should get a priority of "2" before integrating the new replacement chassis with priority 1:

#! Still active chassis:
chassis <ACTIVE-CHASSIS-ID> priority 2
 
#! Replacement chassis (factory default)
chassis <REPLACEMENT-CHASSIS-ID> priority 1

If someone wants - for any reason, the priorities can be changed back to the previous state (before the failure), after the SSO cluster is fully functional again.

I guess that Cisco published somewhere an updated guide for this (some time after my initial post and a TAC case)