08-04-2023 01:40 AM - edited 08-04-2023 01:42 AM
Hey Guys,
I'm running a pair of 9800-40 WLCs in HA SSO mode (with RMI) that are directly connected to the upstream L3 switches that handles the DF GW for the WMI subnet.
Due to the geographical location of the server room i had to connect the RP to the same switches that handles the df gw and span the L2 VLAN for the RP to the other server room.
I noticed that in failure scenario where the HSRP Active Distribution switch goes down (during the upgrade of the IOS) and it fails over to the secondary Distro switch (severing all the connections to the previously Active WLC - RP and Data ports) the secondary WLC should take over the role of the Active, and indeed it does. However it lost the IP connectivity to the default gateway. I was not able to ping the default gateway at all, unless i cleared the arp entry for the df gw. From then on the L3 connectivity was restored and WLC was reachable again via WMI.
Is this expected behavior ? Could this be caused by the fact that im using the L2 RP connection through the same switches that handle the L3 gateway for the WLCS?
From my perspective it should not matter because if one of the Distro switches fail, it should basically be the same scenario as if the whole controller shut down (all links down, RP and Data ports)
08-04-2023 04:22 AM
Hi @mqontt
I can not afirm that the design is the problem but if we check the cisco doc, your design should be the thirdy one. I would try to follow the cisco design and test. After all, if you open a TAC today to fix this, they will probably requet you to change.
08-05-2023 07:02 AM - edited 08-09-2023 07:17 AM
I agree that it should work.
Although it's not 100% compliant with Cisco design guide I don't think that is the problem because it's the default gateway you have the problem with, not the RP connection.
1. What version of software are you using? (refer TAC recommended below) There's been a lot of fixes and enhancements to HA in successive releases.
2. Are you using built-in MAC for HSRP so that the default gateway MAC needs to change on switchover? If the MAC doesn't change there's no reason why it should be a problem. If you are using hardware MAC you'll need to setup EEM to clear the ARP on switchover. HSRP should do a promiscuous ARP on switchover but any device which misses that will not update cache. What does the ARP cache look like before and after clearing?
08-05-2023 01:31 PM
"Due to the geographical location of the server room i had to connect the RP to the same switches that handles the df gw and span the L2 VLAN for the RP to the other server room."
I think your RP connection also go through same HSRP switches when HSRP active is down, your HA link also going down. That is a problem. I would see HA link established seperate L2 switches and test it.
HTH
Rasika
*** Pls rate all useful responses ***
08-06-2023 04:01 AM
But @Rasika Nayanajith I think his point was that a failed RP connection (while that switch upgrade happens) is no different to a completely failed primary WLC (eg lost power) so the HA should still work right? Or are you thinking (and maybe Cisco devs did) that the RP port must go hard down? Even assuming that is the case - why does clearing the ARP cache cure the problem?
08-08-2023 02:43 AM
Thanks guys for the info.
From the info i gathered the timeline was something like this:
What i dont get is why the previously active member went into active-recovery where it shut its ports. This should only happen if the RP connectivity is lost.
I tried to replicate this gracefully (without reloading the distro switch) just by shutting down the Data ports and RP to the standby WLC.
And this worked just fine, once the ports were unshut the standby took the role of the active, sent GARP for the WMI to the distro switch, previously active rebooted and joined the cluster as standby.
1. WLCs are running on recommended 17.9.3 version
2. When i checked the ip arp from the WLC, it is using the virtual mac of the HSRP group. And the ARP table before and after the clean was looking exactly the same (df gw ip had the same virtual mac address of the hsrp group). I think that the problem was that the WLC that became active did not send the GARP to the switch and maybe i just forced it to do so after i did clear arp from the cli. Maybe it did send the GARP but it didnt arrive at the switch because it took some time to bring up the PO after the distro switch rebooot ? hard to say
08-09-2023 04:30 AM
Yes I agree with your analysis - I think it's basically a race condition which might be aggravated by you sending that RP connection through the switches. It would be interesting to know if you still got that with a direct p2p RP cable.
08-09-2023 04:45 AM - edited 08-09-2023 04:46 AM
IMO it should not happen with direct back to back RP connection between WLCs,
because if one of the upstream Distro switches get rebooted/shut down. WLCs will still be able to communicate over the RP.
So it shouldnt come to situation like it did before, where the RP is also severed and secondary WLC takes role of the active (since it doesnt see other WLC via RP) and sends GARP that would not be able to reach the HSRP active distro switch (due to crosslink coming up later than data ports)
If WLCs are connected back to back for the RP, then only following scenarios can occur. And it should never come to the active-recovery state.
So in conclusion, do not use the switches that you use as L3 HSRP gateway for L2 RP connection. Either use back-to-back connection, or different l2 switch.
anyway, thanks a lot @Rich R
08-08-2023 02:58 AM - edited 08-08-2023 03:21 AM
BTW i can see the crosslink between the switches came up few seconds after the Data Ports / RP to WLC2.
That probably caused the Active-recovery state, since the RP connection (from the point of WLCs) was down because crosslink was still down.
WLC2 took the role of active and sent the GARP to Distro2 switch (before the crosslink was up). HSRP active was still on Distro1, so it never got the GARP from the WLC2 and still had WMI ip mapped to physical address of WLC1 (which in the meantime went to active-recovery and shut its ports, because RP was down due to crosslink being down)
Even if i gracefully tested the failover by shutting the RP/Data ports to WLC, failover worked perfectly fine because i didnt touch the crosslink between the distro switches, and GARP was able to be shared between the distro SWs. So i think the problem is mainly when the crosslink comes up later than the data ports to the WLC.
Anyway, if i stick to the recommended back-to-back connectivity between WLC RPs i should be able to avoid this, becasue then the WLC cluster will not go into active-recovery during the Distro switch reboot/shutdown/failure
08-09-2023 04:32 AM
Yes kind of agrees with what I suggested in my previous answer.
So moral of the story is stick with the back to back RP connection because the behaviour can be unpredictable and undesirable without it...
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide