cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2181
Views
3
Helpful
6
Replies

C9800 HA SSO pair broken

Redguy
Visitor

We have been using a pair of vmWare based Cisco C9800 WLC's in HA-SSO for a few years now. 
Apart from the fact that the VM's dont like being vMotioned or have backup snapshots taken, they have been functioning quite well.

Our server guys recently built a new vmWare cluster and tried to move the WLC VMs to the new cluster, which failed. The WLC they moved responded to SSH CLI and Web GUI but lost all configuration somehow. Our whole wifi environment went offline.

After restoring the VM's on the old cluster things came back online, but when checking the status of the HA pair of it turned out that the primary WLC VM is doing what it should be doing, but it reports its standby member as "removed" with a mac address 0000.0000.0000 (show chassis)

The standby VM console shows it is carrying the hostname of the primary, and reports it is a standalone WLC. Clearly the 2 VM's ended up in a split brain situation or something. The server guys told me that at the moment they moved the first WLC, the network between the 2 clusters connecting them might not have been configured fully yet.

This creates 3 problems for me : 

  1. How do I repair the HA SSO pair without interrupting our wifi networks (if possible)
  2. What is the best procedure to move the WLC (cluster) to another vmWare cluster ?
  3. For some reason the WLC VMs (installed by an external vendor) still have the .ISO mounted. Is this normal and should this remain like this ? Or can we just remove the mounted .ISO frm the VM's ?

I wonder if it might be a lot easier to revert the primary to a standalone WLC, move that, and then add a newly created WLC VM as standby ?

 

6 Replies 6

Mark Elsen
Hall of Fame
Hall of Fame

 

  - @Redguy   You have lots of parameters undefined and possibly leading to trouble such as :
                       >...and tried to move the WLC VMs to the new cluster,
                       But technical details on what was tried is left out. It's a bit the same with
                      >...After restoring the VM's on the old cluster
                      No technical details provided here.

  Skipping to :  What is the best procedure to move the WLC (cluster) to another vmWare cluster 
                        You probably can't do that in a transparent manner because officially the HA SSO 
                        is only supported on a 'single cell' vmware cluster.
                        I would indeed more look into building the new cluster on the new vmware environment and using that
                       as an N+1 'HA partner' for the current environment. Giving APs the ability to fallback
                       to the new environment when the current cluster is abandoned. Then you can also
                       prepare the new environment on a relaxed basis first  and check it out.

 Appendix 1):  Always validate a new environment (controller) with the CLI command 
                                   show tech wireless and feed the output from that into Wireless Config Analyzer

 Appendix 2): below are some useful CLI commands for troubleshooting HA SSO
test wireless redundancy rping  
show redundancy | i ptime|Location|Current Software state|Switchovers
show chassis
show chassis detail
show chassis ha-status local
show chassis ha-status active
show chassis ha-status standby
show chassis rmi
show redundancy
show redundancy history
show redundancy switchover history
show tech wireless redundancy
show redundancy states


 



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

I wonder if it might be a lot easier to revert the primary to a standalone WLC, move that, and then add a newly created WLC VM as standby ?

This good but need some correction' 

Break SSO and move the secondary WLC' 

Then force AP to join secondary WLC one by one or as groups

Then move primary to other site after you sure all AP join secondary wlc

Last re config SSO again 

https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/213915-configure-catalyst-9800-wireless-control.html#toc-hId--121408849

MHM

Redguy
Visitor

Additional information : 

From what understand from the server guys : They moved the primary WLC VM (S901) by doing a storage move to the new vmWare cluster (with it's own venter etc), so no vMotion. They were counting on the standby WLC to take over while the primary was being moved.

However, the needed networks for the S901 VM were not yet fully configured on the new cluster 😞 So the moved S901 booted without being able to contact it standby peer S902 or the APs. They completed the network settings a few minutes later. At that moment our wifi environment was already offline. Most of our SSID's need authentication via radius which did not work since that goes via the WLC first. 

At that moment in time the server guys finally told me about what they were doing because trouble tickets were pouring into our service desk. (We had a very stern discussion about why the <bleep> they did this during regular office hours and without talking to "networks"" first. But too little too late)

I logged in to the S901 (i thought, but i think it was the S902 in hindsight) and noticed that it was running but all APs etc were gone. I wanted to restore the config from backup, but the server guys beat me to it by restoring the S901 WLC vm on the old cluster and booting that (and killing the moved S901). This restore action worked, our users were able to work again. Panic over.

Later, while trying to figure out what happened exactly and checking if everything was okay for now, i noticed that the HA was broken :

xxx-xxx-S901#show chassis
Chassis/Stack Mac Address : xxxx.xxxx.xxxx - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP
-------------------------------------------------------------------------------------
*1 Active x.x.x.x 2 V02 Ready 169.254.24.7
2 Member 0000.0000.0000 0 V02 Removed 169.254.24.8


The S902 VM was not responding via the network at all, so i checked the console via vmWare.

This showed that the S902 is now called the S901. The show chassis command showed that chassis #2 is in the Active role and in the ready state. IP 169.254.24.8 But it showed only the "S902" chassis in the list, no HA partner was visible at all.

 

So that is the status at the moment. The S901 vm is running on the "old" cluster, doing what it should be doing. No HA standby though. The S902 vm (also still on the old cluster) can be booted but it comes up as the S901 and thinks it is a standalone WLC even though that "show chassis" shows it as chassis# *2.

I kee the S902 vm offline to make sure we get no duplicate IP or a split brain fight between the 2 vm's.

 

  - @Redguy   There is too much going on beyond your exact control. Mistakes can not be excluded.
                      I can only advice to check logs on the controllers  when things are going wrong (show logging).

                      Also take a look at : https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-wireless-controllers-cloud/218438-verify-support-vmware-vsphere-vmotion-wi.html
                     It contains a few tables about what is allowed or not when  moving controllers between platforms

                    Also from the same document :
                   >...Recommendation: For best results, it is recommended to configure RP port keepalives to at least twice the default 100 ms keepalive (set it to 200 ms). If the network between storage and hosts can become busy and increase latency, consider to set the keepalives timer to 300 ms. To configure the keepalive timer on the GUI, go to Administration > Device > Redundancy:
                >...

C9800-SSO#chassis redundancy keep-alive timer 3 

          M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

I am totally with @Mark Elsen 

This beyond your control.

And for Vmotion' what is this relate to migrate wlc from one vm to other ?

MHM

Rich R
VIP
VIP

I think there's a good chance they removed the WLC-required network changes (Promiscuous mode and Forged Transmits) for the S902 and probably also forgot to configure them for the new VMs (our lab team did this when moving things around recently even after I had reminded them).
https://www.cisco.com/c/en/us/td/docs/wireless/controller/9800/technical-reference/c9800-best-practices.html#C9800CLconsiderations

Check the Best Practices guide for the rest of the 9800-CL considerations and also make sure everything in the installation and setup guide is covered too.

You should have been able to move them if it had all been set up correctly in advance and timing of the moves had been done right but given the current situation you have - agreed that you would be best reverting to standalone WLC (break HA-SSO), move the standalone server (during a maintenance window for planned outage) and then re-create the backup and re-establish HA-SSO. 

Or if you don't want the outage then build a new WLC as N+1. Move the APs to the new WLC - the advantage being that you can move 1 or 2 APs first and test to make sure it's working.  When you're happy it's good move the rest of the APs.  Then either move the old one or build a new one to join to the other new one as HA-SSO pair.

Depending on what version you're running now you might be better off building new VMs with larger bootflash partitions to accommodate the larger disk size required on newer versions.

------------------------------
Please click Helpful if this post helped you and Accept as Solution if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's   and   TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's,   Best Practices for 9800 WLC's   and   Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
Field Notice: FN74383 APs Running 17.12.4/5/6/6a May Run Out of Flash Space Preventing Upgrades
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390
Review Cisco Networking for a $25 gift card