9800 HA SSO RP failover issue

mqontt · ‎08-04-2023

Hey Guys,

I'm running a pair of 9800-40 WLCs in HA SSO mode (with RMI) that are directly connected to the upstream L3 switches that handles the DF GW for the WMI subnet.

Due to the geographical location of the server room i had to connect the RP to the same switches that handles the df gw and span the L2 VLAN for the RP to the other server room.

I noticed that in failure scenario where the HSRP Active Distribution switch goes down (during the upgrade of the IOS) and it fails over to the secondary Distro switch (severing all the connections to the previously Active WLC - RP and Data ports) the secondary WLC should take over the role of the Active, and indeed it does. However it lost the IP connectivity to the default gateway. I was not able to ping the default gateway at all, unless i cleared the arp entry for the df gw. From then on the L3 connectivity was restored and WLC was reachable again via WMI.

Is this expected behavior ? Could this be caused by the fact that im using the L2 RP connection through the same switches that handle the L3 gateway for the WLCS?

From my perspective it should not matter because if one of the Distro switches fail, it should basically be the same scenario as if the whole controller shut down (all links down, RP and Data ports)

Flavio Miranda · ‎08-04-2023

Hi @mqontt

I can not afirm that the design is the problem but if we check the cisco doc, your design should be the thirdy one. I would try to follow the cisco design and test. After all, if you open a TAC today to fix this, they will probably requet you to change.

Rich R · ‎08-05-2023

I agree that it should work.
Although it's not 100% compliant with Cisco design guide I don't think that is the problem because it's the default gateway you have the problem with, not the RP connection.

1. What version of software are you using? (refer TAC recommended below) There's been a lot of fixes and enhancements to HA in successive releases.
2. Are you using built-in MAC for HSRP so that the default gateway MAC needs to change on switchover? If the MAC doesn't change there's no reason why it should be a problem. If you are using hardware MAC you'll need to setup EEM to clear the ARP on switchover. HSRP should do a promiscuous ARP on switchover but any device which misses that will not update cache. What does the ARP cache look like before and after clearing?

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

Rasika Nayanajith · ‎08-05-2023

"Due to the geographical location of the server room i had to connect the RP to the same switches that handles the df gw and span the L2 VLAN for the RP to the other server room."

I think your RP connection also go through same HSRP switches when HSRP active is down, your HA link also going down. That is a problem. I would see HA link established seperate L2 switches and test it.

HTH
Rasika
*** Pls rate all useful responses ***

Rich R · ‎08-06-2023

But @Rasika Nayanajith I think his point was that a failed RP connection (while that switch upgrade happens) is no different to a completely failed primary WLC (eg lost power) so the HA should still work right? Or are you thinking (and maybe Cisco devs did) that the RP port must go hard down? Even assuming that is the case - why does clearing the ARP cache cure the problem?

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

mqontt · ‎08-08-2023

Thanks guys for the info.

From the info i gathered the timeline was something like this:

WLC 1 active, Distro 1 HSRP active.
Distro 2 got shut down. WLC1 still active, WLC2 not reachable (RMI+RP) due to Distro2 being down
Distro 2 boots up. RP + data PO to WLC2 goes up
After approx 2 minutes of Distro2 booting up the WMI IP of the cluster is not reachable
WLC2 seems to become active (Data PO is UP) and WLC1 seems to be in active-recovery state, because Distro1 reports the Port-channel to WLC1 as LACP suspended
At this stage i try to connect to MGMT IP of the Cluster, i get connected to WLC2 since that is the active at the moment
I try to ping default gateway from the WLC2, no response
try to check ip arp table, the DF GW arp entry is 38min old, with the virtual mac of the HSRP group
I do clear ip arp of the DF GW IP, connectivity restored
I do check ip arp table again, DF GW mac address is exactly the same virtual mac of the HSRP group as before.

What i dont get is why the previously active member went into active-recovery where it shut its ports. This should only happen if the RP connectivity is lost.

I tried to replicate this gracefully (without reloading the distro switch) just by shutting down the Data ports and RP to the standby WLC.

And this worked just fine, once the ports were unshut the standby took the role of the active, sent GARP for the WMI to the distro switch, previously active rebooted and joined the cluster as standby.

@Rich R

1. WLCs are running on recommended 17.9.3 version
2. When i checked the ip arp from the WLC, it is using the virtual mac of the HSRP group. And the ARP table before and after the clean was looking exactly the same (df gw ip had the same virtual mac address of the hsrp group). I think that the problem was that the WLC that became active did not send the GARP to the switch and maybe i just forced it to do so after i did clear arp from the cli. Maybe it did send the GARP but it didnt arrive at the switch because it took some time to bring up the PO after the distro switch rebooot ? hard to say

Rich R · ‎08-09-2023

Yes I agree with your analysis - I think it's basically a race condition which might be aggravated by you sending that RP connection through the switches. It would be interesting to know if you still got that with a direct p2p RP cable.

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

mqontt · ‎08-09-2023

IMO it should not happen with direct back to back RP connection between WLCs,

because if one of the upstream Distro switches get rebooted/shut down. WLCs will still be able to communicate over the RP.

So it shouldnt come to situation like it did before, where the RP is also severed and secondary WLC takes role of the active (since it doesnt see other WLC via RP) and sends GARP that would not be able to reach the HSRP active distro switch (due to crosslink coming up later than data ports)

If WLCs are connected back to back for the RP, then only following scenarios can occur. And it should never come to the active-recovery state.

So in conclusion, do not use the switches that you use as L3 HSRP gateway for L2 RP connection. Either use back-to-back connection, or different l2 switch.

anyway, thanks a lot @Rich R

mqontt · ‎08-08-2023

BTW i can see the crosslink between the switches came up few seconds after the Data Ports / RP to WLC2.

That probably caused the Active-recovery state, since the RP connection (from the point of WLCs) was down because crosslink was still down.

WLC2 took the role of active and sent the GARP to Distro2 switch (before the crosslink was up). HSRP active was still on Distro1, so it never got the GARP from the WLC2 and still had WMI ip mapped to physical address of WLC1 (which in the meantime went to active-recovery and shut its ports, because RP was down due to crosslink being down)

Even if i gracefully tested the failover by shutting the RP/Data ports to WLC, failover worked perfectly fine because i didnt touch the crosslink between the distro switches, and GARP was able to be shared between the distro SWs. So i think the problem is mainly when the crosslink comes up later than the data ports to the WLC.

Anyway, if i stick to the recommended back-to-back connectivity between WLC RPs i should be able to avoid this, becasue then the WLC cluster will not go into active-recovery during the Distro switch reboot/shutdown/failure

Rich R · ‎08-09-2023

Yes kind of agrees with what I suggested in my previous answer.
So moral of the story is stick with the back to back RP connection because the behaviour can be unpredictable and undesirable without it...

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390