08-19-2025 06:07 AM
We have been using a pair of vmWare based Cisco C9800 WLC's in HA-SSO for a few years now.
Apart from the fact that the VM's dont like being vMotioned or have backup snapshots taken, they have been functioning quite well.
Our server guys recently built a new vmWare cluster and tried to move the WLC VMs to the new cluster, which failed. The WLC they moved responded to SSH CLI and Web GUI but lost all configuration somehow. Our whole wifi environment went offline.
After restoring the VM's on the old cluster things came back online, but when checking the status of the HA pair of it turned out that the primary WLC VM is doing what it should be doing, but it reports its standby member as "removed" with a mac address 0000.0000.0000 (show chassis)
The standby VM console shows it is carrying the hostname of the primary, and reports it is a standalone WLC. Clearly the 2 VM's ended up in a split brain situation or something. The server guys told me that at the moment they moved the first WLC, the network between the 2 clusters connecting them might not have been configured fully yet.
This creates 3 problems for me :
I wonder if it might be a lot easier to revert the primary to a standalone WLC, move that, and then add a newly created WLC VM as standby ?
08-19-2025 06:28 AM
- @Redguy You have lots of parameters undefined and possibly leading to trouble such as :
>...and tried to move the WLC VMs to the new cluster,
But technical details on what was tried is left out. It's a bit the same with
>...After restoring the VM's on the old cluster
No technical details provided here.
Skipping to : What is the best procedure to move the WLC (cluster) to another vmWare cluster
You probably can't do that in a transparent manner because officially the HA SSO
is only supported on a 'single cell' vmware cluster.
I would indeed more look into building the new cluster on the new vmware environment and using that
as an N+1 'HA partner' for the current environment. Giving APs the ability to fallback
to the new environment when the current cluster is abandoned. Then you can also
prepare the new environment on a relaxed basis first and check it out.
Appendix 1): Always validate a new environment (controller) with the CLI command
show tech wireless and feed the output from that into Wireless Config Analyzer
Appendix 2): below are some useful CLI commands for troubleshooting HA SSO
test wireless redundancy rping
show redundancy | i ptime|Location|Current Software state|Switchovers
show chassis
show chassis detail
show chassis ha-status local
show chassis ha-status active
show chassis ha-status standby
show chassis rmi
show redundancy
show redundancy history
show redundancy switchover history
show tech wireless redundancy
show redundancy states
08-19-2025 06:36 AM - edited 08-19-2025 06:51 AM
I wonder if it might be a lot easier to revert the primary to a standalone WLC, move that, and then add a newly created WLC VM as standby ?
This good but need some correction'
Break SSO and move the secondary WLC'
Then force AP to join secondary WLC one by one or as groups
Then move primary to other site after you sure all AP join secondary wlc
Last re config SSO again
MHM
08-19-2025 11:23 PM
Additional information :
From what understand from the server guys : They moved the primary WLC VM (S901) by doing a storage move to the new vmWare cluster (with it's own venter etc), so no vMotion. They were counting on the standby WLC to take over while the primary was being moved.
However, the needed networks for the S901 VM were not yet fully configured on the new cluster 😞 So the moved S901 booted without being able to contact it standby peer S902 or the APs. They completed the network settings a few minutes later. At that moment our wifi environment was already offline. Most of our SSID's need authentication via radius which did not work since that goes via the WLC first.
At that moment in time the server guys finally told me about what they were doing because trouble tickets were pouring into our service desk. (We had a very stern discussion about why the <bleep> they did this during regular office hours and without talking to "networks"" first. But too little too late)
I logged in to the S901 (i thought, but i think it was the S902 in hindsight) and noticed that it was running but all APs etc were gone. I wanted to restore the config from backup, but the server guys beat me to it by restoring the S901 WLC vm on the old cluster and booting that (and killing the moved S901). This restore action worked, our users were able to work again. Panic over.
Later, while trying to figure out what happened exactly and checking if everything was okay for now, i noticed that the HA was broken :
xxx-xxx-S901#show chassis
Chassis/Stack Mac Address : xxxx.xxxx.xxxx - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP
-------------------------------------------------------------------------------------
*1 Active x.x.x.x 2 V02 Ready 169.254.24.7
2 Member 0000.0000.0000 0 V02 Removed 169.254.24.8
The S902 VM was not responding via the network at all, so i checked the console via vmWare.
This showed that the S902 is now called the S901. The show chassis command showed that chassis #2 is in the Active role and in the ready state. IP 169.254.24.8 But it showed only the "S902" chassis in the list, no HA partner was visible at all.
So that is the status at the moment. The S901 vm is running on the "old" cluster, doing what it should be doing. No HA standby though. The S902 vm (also still on the old cluster) can be booted but it comes up as the S901 and thinks it is a standalone WLC even though that "show chassis" shows it as chassis# *2.
I kee the S902 vm offline to make sure we get no duplicate IP or a split brain fight between the 2 vm's.
08-20-2025 01:41 AM
- @Redguy There is too much going on beyond your exact control. Mistakes can not be excluded.
I can only advice to check logs on the controllers when things are going wrong (show logging).
Also take a look at : https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-wireless-controllers-cloud/218438-verify-support-vmware-vsphere-vmotion-wi.html
It contains a few tables about what is allowed or not when moving controllers between platforms
Also from the same document :
>...Recommendation: For best results, it is recommended to configure RP port keepalives to at least twice the default 100 ms keepalive (set it to 200 ms). If the network between storage and hosts can become busy and increase latency, consider to set the keepalives timer to 300 ms. To configure the keepalive timer on the GUI, go to Administration > Device > Redundancy:
>...
C9800-SSO#chassis redundancy keep-alive timer 3
M.
08-20-2025 02:55 AM
I am totally with @Mark Elsen
This beyond your control.
And for Vmotion' what is this relate to migrate wlc from one vm to other ?
MHM
08-20-2025 05:42 AM
I think there's a good chance they removed the WLC-required network changes (Promiscuous mode and Forged Transmits) for the S902 and probably also forgot to configure them for the new VMs (our lab team did this when moving things around recently even after I had reminded them).
https://www.cisco.com/c/en/us/td/docs/wireless/controller/9800/technical-reference/c9800-best-practices.html#C9800CLconsiderations
Check the Best Practices guide for the rest of the 9800-CL considerations and also make sure everything in the installation and setup guide is covered too.
You should have been able to move them if it had all been set up correctly in advance and timing of the moves had been done right but given the current situation you have - agreed that you would be best reverting to standalone WLC (break HA-SSO), move the standalone server (during a maintenance window for planned outage) and then re-create the backup and re-establish HA-SSO.
Or if you don't want the outage then build a new WLC as N+1. Move the APs to the new WLC - the advantage being that you can move 1 or 2 APs first and test to make sure it's working. When you're happy it's good move the rest of the APs. Then either move the old one or build a new one to join to the other new one as HA-SSO pair.
Depending on what version you're running now you might be better off building new VMs with larger bootflash partitions to accommodate the larger disk size required on newer versions.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide