09-05-2019 07:16 AM - edited 07-05-2021 10:57 AM
We have a customer with a pair of 5520 WLCs in HA SSO; these are connected using the redundancy ports, & have redundancy management interfaces configured on a shared VLAN.
My expectation was that as long as there was IP connectivity between the redundancy management interfaces disconnecting the RP should stop SSO, but not cause failover; but when we tested this disconnecting the RP causes a split brain condition with both WLCs active.
Has anyone else experienced this behaviour? As we only have a single RP per device I'm struggling to see how we can HA working in this scenario.
Software is 8.8.125.0
thanks
09-05-2019 12:08 PM
This thread may help you
HTH
Rasika
*** Pls rate all useful responses ***
09-05-2019 12:09 PM
Hi
It seems to me that the behavior you are describing is the expected behavior. WLC in HA SSO uses the RP connection to send keep alive to make sure both see each other. If you interrupt this they will perform split brain.
Which behavior do you expect to see by disconnecting the RP ?
-If I helped you somehow, please, rate it as useful.-
09-06-2019 03:21 AM
What I expect to see (& have seen for other customers) is that when the RP is disconnected the HA process on the secondary checks for connectivity over the redundancy-management interface & stays in standby if this connectivity exists; see the quote below from the SSO config guide:
The IP address on this interface should be configured in the same subnet as the management interface. This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port. This provides an additional health check of the network and Active WLC, and confirms if switchover should or should not be executed."
A redundancy solution which depends on a single connection would be broken by design, I expect to lose SSO when the RP fails due to lack of synchronisation, but not to end up with a split brain.
What am I missing?
09-06-2019 02:03 AM
@JON SHORTEN wrote:
but when we tested this disconnecting the RP causes a split brain condition with both WLCs active.
The only for two WLC to "see" each other is through the Redundancy Port (RP).
When you disconnect the RP, the secondary immediately goes active. The primary will immediately "think" the secondary has failed.
This is a normal behaviour.
09-06-2019 03:22 AM
Can you then please explain the following from the SSO config guide:
The IP address on this interface should be configured in the same subnet as the management interface. This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port. This provides an additional health check of the network and Active WLC, and confirms if switchover should or should not be executed.
09-06-2019 04:38 AM
@JON SHORTEN wrote:
This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port.
This line explains it all.
The interface in question is the RP.
The RP is like routing protocols: They exchange "Hello" packets. If I don't receive the "Hello" packet, then my peer is down.
09-06-2019 05:50 AM
@Leo LaohooI think you're getting confused by the interface names, the Redundancy Management interface isn't the same thing as the Redundancy Port, see the definitions from the SSO config guide below:
The IP address on this interface should be configured in the same subnet as the management interface. This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port. This provides an additional health check of the network and Active WLC, and confirms if switchover should or should not be executed. Also, the Standby WLC uses this interface in order to source ICMP ping packets to check gateway reachability. This interface is also used in order to send notifications from the Active WLC to the Standby WLC in the event of Box failure or Manual Reset. The Standby WLC will use this interface in order to communicate to Syslog, the NTP server, and the TFTP server for any configuration upload.
This interface has a very important role in the new HA architecture. Bulk configuration during boot up and incremental configuration are synced from the Active WLC to the Standby WLC using the Redundant Port. WLCs in a HA setup will use this port to perform HA role negotiation. The Redundancy Port is also used in order to check peer reachability sending UDP keep-alive messages every 100 msec (default timer) from the Standby WLC to the Active WLC. Also, in the event of a box failure, the Active WLC will send notification to the Standby WLC via the Redundant Port. If the NTP server is not configured, a manual time sync is performed from the Active WLC to the Standby WLC on the Redundant Port. This port in case of standalone controller will be assigned an auto generated IP Address where last 2 octets are picked from the last 2 octets of Redundancy Management Interface (the first 2 octets are always 169.254).
Note the screenshot showing these as 2 different interfaces.
09-06-2019 06:36 AM
I´ll stick with the fact that without RD link ok, WLC will split brain as a normal behavior. The redundant management interface, which is a logical link, can be seeing as a double check to avoid unnecessary split brain but, by no mean the HA will stay alive in the event of the RD link is broken. And this is true for WLC and any other system I know that uses HA. Without this physical connection, there will be no HA.
-If I helped you somehow, please, rate it as useful.-
09-06-2019 07:11 AM
So what do you think the point of a redundant solution is if a single link failure can cause a complete outage?
The table below (from the SSO deployment guide) shows what should happen by design; I'm trying to find out why this particular customer is seeing different behaviour.
Network IssuesRP Port StatusPeer Reachable via Redundant ManagementGateway Reachable from ActiveGateway Reachable from StandbySwitchoverResults
Up | Yes | Yes | Yes | No | No Action |
Up | Yes | Yes | No | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Up | Yes | No | Yes | Yes | Switchover happens |
Up | Yes | No | No | No | No Action |
Up | No | Yes | Yes | No | No Action |
Up | No | Yes | No | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Up | No | No | Yes | Yes | Switchover happens |
Up | No | No | No | No | No Action |
Down | Yes | Yes | Yes | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Down | Yes | Yes | No | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Down | Yes | No | Yes | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Down | Yes | No | No | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Down | No | Yes | Yes | Yes | Switchover happens and this may result in Network Conflict |
Down | No | Yes | No | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Down | No | No | Yes | Yes | Switchover happens |
Down | No | No | No | No | Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable. |
Check the 9th line, which shows the scenario in question, standby should reboot to maintenance mode without switchover.
I say again, I've done this many times without seeing split brain when the RP fails, just not with this controller / code combo.
09-06-2019 04:15 PM
@JON SHORTEN wrote:
think you're getting confused by the interface names, the Redundancy Management interface isn't the same thing as the Redundancy Port,
Spelling-wise, RP and Redundancy Management are different. Function-wise, they are the same. Redundancy Port is a physical port. Redundancy Management is an management port (think IP address).
Look at the IP address of both.
We can debate all year long about this.
HA SSO got the same "mechanics" as the VSS: There is a link that links two chassis together and this link does nothing but send and receives "Hello" packets. Take out that link and both units will go active simultaneously.
09-13-2019 07:37 AM
Replying to my own post to confirm that HA failover works as detailed in the config guide,
The issue I initially posted was due to weird behaviour from the gateway (clustered Juniper firewall) causing both WLCs to think they had a reachable gateway for redundancy-management when there was no connectivity between them. (Firewalls went split brain, which caused WLcs to do the same)
With correct gateway behavior removing the RP between HA WLCs does NOT cause split brain, CIsco are better at designing HA than that.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: