Solved: HA SSO Switchover isn't triggerd when activ WLC loses default-gw

Shiden · ‎02-22-2023

Hello all,

I am currently configuring HA SSO with RMI+RP on a Catalyst 9800-L (Firmware 17.06.04) Wireless controller.

The peering works perfectly:

WLC#sh chassis Rmi
Chassis/Stack Mac Address : 0845.d117.c840 - Local Mac Address
Mac persistency wait time: Indefinite
Local Redundancy Port Type: Twisted Pair
H/W Current
Chassis# Role Mac Address Priority Version State IP RMI-IP
--------------------------------------------------------------------------------------------------------
*1 Active 0845.d117.c840 2 V02 Ready 169.254.0.13 10.10.0.13
2 Standby 0845.d117.0960 1 V02 Ready 169.254.0.14 10.10.0.14

The switchover also works perfectly when the active WLC goes off. Now I want when the active WLC loses connectivity to the default-gateway, the switchover is triggered as well. Here is what I configured:

management gateway-failover enable

ip default-gateway <ip>

Here the redundancy state of the WLC

WLC#sh redundancy states
my state = 13 -ACTIVE
peer state = 8 -STANDBY HOT
Mode = Duplex
Unit = Primary
Unit ID = 1

Redundancy Mode (Operational) = sso
Redundancy Mode (Configured) = sso
Redundancy State = sso
Maintenance Mode = Disabled
Manual Swact = enabled
Communications = Up

client count = 150
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs

Now my problem. When I unplug the uplink (RP is still plugged) nothing happens and I don't why. After the Cisco documentation, the switchover should be triggered.

https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-1/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-1.pdf (page 30).

The access points goes down because the active WLC is no more reachable. I also see logs that the RMI link is no more reachable on both (active and standby) WLCs. The RMI links don't have to be UP in order that the switchover is triggered, right? Otherwise, what could it be?

I already say thanks to the people who will take time to answer this post.

Arshad Safrulla · ‎02-23-2023

Make sure tht the mobility mac address is configured. Is RMI IP part of the same subnet as WMI interface? (Recommendation is that it must be part of the same subnet).

ip default-gateway must be configured and it should be the gateway of the RMI Interface. (In your case 10.10.0.0 network)

Post the below outputs if you need for assitance

show run all | i redun
show run | i redun
show run interface Vlan <WMI interface VLAN>

Most importantly make sure that the garp is enabled where the Gateway resides and upstream switchports connecting to the WLC are properly configured (great if you can post the config, recommendations- no native vlan, only allow wireless vlans, spanning tree portfast edge added to the ports)

___________________________________________
TAC recommended codes for AireOS WLC's
Best Practices for AireOS WLC's
TAC recommended codes for 9800 WLC's
Best Practices for 9800 WLC's
Cisco Wireless compatibility matrix
___________________________________________
Arshad Safrulla

View solution in original post

Mark Elsen · ‎02-23-2023

>.... Now I want when the active WLC loses connectivity to the default-gateway,...
- In general HA SSO is not designed for that , it is designed to provide wireless service on a 'box failure' ; with RMI+RP you may have failover for a local link failure too , but not for a default gateway ; that is an external network problem so to speak ,

M.

-- Let everything happen to you
   Beauty and terror
      Just keep going
     No feeling is final
Reiner Maria Rilke (1899)

Arshad Safrulla · ‎02-23-2023

Make sure tht the mobility mac address is configured. Is RMI IP part of the same subnet as WMI interface? (Recommendation is that it must be part of the same subnet).

ip default-gateway must be configured and it should be the gateway of the RMI Interface. (In your case 10.10.0.0 network)

Post the below outputs if you need for assitance

show run all | i redun
show run | i redun
show run interface Vlan <WMI interface VLAN>

Most importantly make sure that the garp is enabled where the Gateway resides and upstream switchports connecting to the WLC are properly configured (great if you can post the config, recommendations- no native vlan, only allow wireless vlans, spanning tree portfast edge added to the ports)

___________________________________________
TAC recommended codes for AireOS WLC's
Best Practices for AireOS WLC's
TAC recommended codes for 9800 WLC's
Best Practices for 9800 WLC's
Cisco Wireless compatibility matrix
___________________________________________
Arshad Safrulla

Shiden · ‎03-08-2023

Hello @Arshad Safrulla,

Sorry for my late reply, I have been on vacation for almost 2 weeks. When I came back, I checked the config again and had basically the same configuration that you mentioned. I tried to configure a default gateway as an IP route like this "ip route 0.0.0.0 0.0.0.0 10.10.0.1" because I saw on another forum, this could fix the problem. I tried again to unplug the uplinks, and it finally worked. To be sure this was the reason, I disabled the route again and try the same, but it also worked. Actually I am a confused with HA SSO, it's like, if you are lucky this day it will work. I don't get what was the issue before, but anyway it seems to work now. So I know what you mean @Scott Fella. Additionally, sometimes the WLC is frozen after a switchover and has to be manually restarted.

I thank you all for your answers. I will accept this one because these are excellent advices for a HA SSO.

Rich R · ‎02-24-2023

Actually @Mark Elsen - the feature is supported from 17.1 (and 17.4 for IPv6) and designed to work exactly that way:
https://www.cisco.com/c/en/us/td/docs/wireless/controller/9800/17-4/config-guide/b_wl_17_4_cg/m_vewlc_high_availability.html#id_109520
https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-6/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-6.pdf

"Default Gateway check is done by periodically sending Internet Control Message Protocol (ICMP) ping to
the gateway. Both the active and the standby controllers use the RMI IP as the source IP. These messages
are sent at 1 second interval. If there are 8 consecutive failures in reaching the gateway, the controller will
declare the gateway as non-reachable.
After 4 ICMP Echo requests fail to get ICMP Echo responses, ARP requests are attempted. If there is no
response for 8 seconds (4 ICMP Echo Requests followed by 4 ARP Requests), the gateway is assumed to
be non-reachable.
IPv6 default gateway detection is supported starting release 17.4. Instead of ICMP and ARP in IPv4, IPv6
shall use ICMP to detect gateway failure."

------------------------------
Please click Helpful if this post helped you and Accept as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

Scott Fella · ‎02-24-2023

Does the primary ever reboot allowing the secondary unit to take over? With a hardware failure or just powering down the primary, the secondary just moves in right away, but not in the scenario. If the primary never reboots, I would suspect some configuration issue or something broken in the back end. You might also try to rebuild the SSO.

I was never a fan of SSO, I have always tested it and have ran into production issues, which now I have stayed to an N+1. By no means am I saying SSO stinks, N+1 to me is manageable and your environment might be different.

Open a TAC case since I would think that you have support on this and let us know how it was fixed.

-Scott
*** Please rate helpful posts ***

Rich R · ‎02-25-2023

Agree with @Scott Fella - if you're sure you've followed the config guide correctly and it's not working then time for a TAC case.
We've generally found SSO very reliable. The only thing we have had occasional trouble with is the gateway reachability test failing and triggering switchover when it shouldn't. Then different Cisco BU's fight over who lost the checks - WLC or router. Don't think we've seen that yet with 9800 though so maybe only an AireOS problem.

------------------------------
Please click Helpful if this post helped you and Accept as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390