Re: C9800 SSO RMI pair: Chassis hardware replacement procedure

Johannes Luther · ‎05-11-2021

Hi board,

assuming in a WLC C9800 SSO cluster with RMI (SW version 17.3.3), you want to replace a failed cluster member... what is the preferred way to do it?

I tried the following and it was a disaster:

Assuming chassis2 has failed completely and needs to be integrated into the cluster:

1.) Make sure the SW version on the new chassis is the same (install mode)

2.) Cable the new factory default chassis to the network (channel uplink and RP)

3.) Assign the correct chassis number to the new chassis

# Exec mode
chassis 2 priority 1
reload

==> Chassis boots up with chassis number 2

4.) Base SSO configuration:

interface Vlan<MANAGEMENT-VLAN-ID>
 ip address 192.168.0.1 255.255.255.0
 no shutdown
!
redun-management interface Vlan<MANAGEMENT-VLAN-ID> chassis 1 address <RMI-IPv4-CHASSIS1> chassis 2 address <RMI-IPv4-CHASSIS2>
!
end
!
write memory

After this, the following log came up on the newly (factory default) chassis:

WLC#wr
Building configuration...
[OK]

WARNING: Reload HA Chassis for RMI configuration to take effect

WLC#
*May 11 10:46:16.688: %SYS-6-PRIVCFG_ENCRYPT_SUCCESS: Successfully encrypted private config file
Chassis 2 reloading, reason - stack merge
May 11 10:46:20.850: %PMAN-5-EXITACTION: C0/0: pvp: Process manager is exiting:
May 11 10:46:21.194: %PMAN-5-EXITACTION: F0/0: pvp: Process manager is exiting:
May 11 10:46:42.457: %PMAN-5-EXITACTIONvp: Process manager is exiting: process exit with reload fru code
May 11 10:46:54.985: %PMAN-3-PROCESS_NOTIFICATION: R0/0: pvp: System report core/WLC_2_RP_0-system-report_20210511-104649-UTC.tar.gz (size: 12529 KB) generated and System report info at core/WLC_2_RP_0-system-report_20210511-104649-UTC-info.txt




Initializing Hardware ...

==> Chassis 2 reboot

However (and that's the problem), chassis 1 reboots as well!

=> Wireless service disruption, because both chassis are booting at the same time.

I would assume, only chassis 2 is rebooting and integrates itself into the cluster...

Am I doing something wrong here, or may I hit a bug here?

saravlak · ‎05-11-2021

Yes, its expected for existing Primary/Active WLC to reboot as well for the first time HA pairing, so the setup should be performed at change window. Replacing with new/different WLC in the setup is going to be similar to initial pairing, otherwise it will stuck in Maintenance mode that require manual intervention ie., reboot both WLC at same time, anyway.

Johannes Luther · ‎05-11-2021

Wow ... needing a maintenance window to replace a failed chassis .... .... .. I lost my faith in current products ....

Thank you for the answer!!!

Scott Fella · ‎05-11-2021

Just out of curiosity, did you verify that chassis 1 had priority set to 2? If I recall correctly, the chassis 2 should come up and become the standby and then the chassis 1 should reboot and chassis 2 becomes active with no interruptions. Now there is that chance that chassis 2 could restart like what happened in your case, but not suppose to.

You can verify and or test by brining up a couple 9800-CL's and see what happens also.

-Scott
*** Please rate helpful posts ***

Johannes Luther · ‎05-12-2021

Hey Scott,

so from my point of view, the priorities are only relevant in the election process, when both WLCs are booting.

But in my case, I set the priorities like recommended in the HA paper:

myWLC#show chassis
Chassis/Stack Mac Address : f4bd.abcd.f660 - Local Mac Address
Mac persistency wait time: Indefinite
Local Redundancy Port Type: Twisted Pair
                                             H/W   Current
Chassis#   Role    Mac Address     Priority Version  State                 IP
-------------------------------------------------------------------------------------
*1       Active   f4bd.abcd.f660     2      V02     Ready                169.254.54.130
 2       Standby  f4bd.abcd.f5a0     1      V02     Ready                169.254.54.131

In either way: If my chassis#1 would have failed and I replace chassis#1 (with prio:2), I would not expect that chassis#1 take over. I would expect, that it integrates as chassis#1 secondary.

The SSO paper has a nice list, how the active WLC is chosen:

1. The wireless controllerthat is currently the active wireless controller
2. The wireless controller with the highest priority value.
3. The wireless controllerwith the shortest start-up time.
4. The wireless controller with the lowest MAC Address.

So based on the list - the currently active WLC should keep its role in any case (except it fails)..

saravlak · ‎05-12-2021

There is no pre-empt functionality with SSO meaning that when the previous Active wireless controller resumes operation, it will not take back the role as an Active wireless controller but will negotiate its state with the current Active wireless controller and transition to Hot-Standby state.

https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-1/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-1.pdf

Scott Fella · ‎05-12-2021

I understand… what I was trying to call out is something that I have run into. The controller isn’t supposed to reboot, but there is alway a chance that it will. If for example, you have experience with AireOS and SSO. There were problems with that too in which the client controller could reboot itself and or go into maintenance mode. So try to lab it out with 9800-CL’s and see if you are successful with a hardware replacement (new VM) or not.

-Scott
*** Please rate helpful posts ***

saravlak · ‎05-11-2021

https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-1/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-1.pdf
On C9800-40 and C9800-80 wireless controller, enable High Availability SSO using the following command on
each of the two wireless controller units

chassis redundancy ha-interface local-ip <local IP> <local IP subnet> remoteip <remote IP>

Reload both wireless controllers by executing the command reload from the CLI

Note: It is recommended to configure HA using the Redundancy Management Interface (RMI) starting Release 17.1. To see
configuration using RMI please see the Redundancy Management Interface section.

saravlak · ‎05-12-2021

Have always rebooted both WLCs as part of initial bringup or replacement addition to avoid frustration particularly the RPs are connected across L2.
It appear, the new WLC trying to add itself to HA-stack as standby-hot initially(election process) for the first time require existing ACTIVE WLC to reboot to do initial sync at the bootup and all the other config database synced once fully booted. this initial scenario is different than failure scenario where both WLCs were already synced in the past. there's many .doc ref for this scenario. unable to find Cisco .doc ref mention that new/replaced WLC will sync with existing Active WLC without ACTIVE reboot, please point that out if found.