cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3251
Views
1
Helpful
11
Replies

HA Failover when RP Disconnected

JON SHORTEN
Level 1
Level 1

We have a customer with a pair of 5520 WLCs in HA SSO; these are connected using the redundancy ports, & have redundancy management interfaces configured on a shared VLAN.

 

My expectation was that as long as there was IP connectivity between the redundancy management interfaces disconnecting the RP should stop SSO, but not cause failover; but when we tested this disconnecting the RP causes a split brain condition with both WLCs active.

 

Has anyone else experienced this behaviour? As we only have a single RP per device I'm struggling to see how we can HA working in this scenario.

 

Software is 8.8.125.0

 

thanks

11 Replies 11

This thread may help you

https://community.cisco.com/t5/wireless-and-mobility/cisco-wlc-8540-ha-sso-maintenance-mode/td-p/2930128

 

HTH

Rasika

*** Pls rate all useful responses ***

Hi

 It seems to me that the behavior you are describing is the expected behavior. WLC in HA SSO uses the RP connection to send keep alive to  make sure both see each other. If you interrupt this they will perform split brain.  

Which behavior do you expect to see by disconnecting the RP ?

 

-If I helped you somehow, please, rate it as useful.-

What I expect to see (& have seen for other customers) is that when the RP is disconnected the HA process on the secondary checks for connectivity over the redundancy-management interface & stays in standby if this connectivity exists; see the quote below from the SSO config guide:

"Redundancy Management Interface

The IP address on this interface should be configured in the same subnet as the management interface. This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port. This provides an additional health check of the network and Active WLC, and confirms if switchover should or should not be executed."

 

A redundancy solution which depends on a single connection would be broken by design, I expect to lose SSO when the RP fails due to lack of synchronisation, but not to end up with a split brain.

 

What am I missing?

Leo Laohoo
Hall of Fame
Hall of Fame

@JON SHORTEN wrote:

but when we tested this disconnecting the RP causes a split brain condition with both WLCs active.


The only for two WLC to "see" each other is through the Redundancy Port (RP).  

When you disconnect the RP, the secondary immediately goes active.  The primary will immediately "think" the secondary has failed. 

This is a normal behaviour.

Can you then please explain the following from the SSO config guide:

Redundancy Management Interface

The IP address on this interface should be configured in the same subnet as the management interface. This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port. This provides an additional health check of the network and Active WLC, and confirms if switchover should or should not be executed.


@JON SHORTEN wrote:

This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port.


This line explains it all. 

The interface in question is the RP.  

The RP is like routing protocols:  They exchange "Hello" packets.  If I don't receive the "Hello" packet, then my peer is down. 

@Leo LaohooI think you're getting confused by the interface names, the Redundancy Management interface isn't the same thing as the Redundancy Port, see the definitions from the SSO config guide below:

 

Redundancy Management Interface

The IP address on this interface should be configured in the same subnet as the management interface. This interface will check the health of the Active WLC via network infrastructure once the Active WLC does not respond to Keepalive messages on the Redundant Port. This provides an additional health check of the network and Active WLC, and confirms if switchover should or should not be executed. Also, the Standby WLC uses this interface in order to source ICMP ping packets to check gateway reachability. This interface is also used in order to send notifications from the Active WLC to the Standby WLC in the event of Box failure or Manual Reset. The Standby WLC will use this interface in order to communicate to Syslog, the NTP server, and the TFTP server for any configuration upload.

 

Redundancy Port

This interface has a very important role in the new HA architecture. Bulk configuration during boot up and incremental configuration are synced from the Active WLC to the Standby WLC using the Redundant Port. WLCs in a HA setup will use this port to perform HA role negotiation. The Redundancy Port is also used in order to check peer reachability sending UDP keep-alive messages every 100 msec (default timer) from the Standby WLC to the Active WLC. Also, in the event of a box failure, the Active WLC will send notification to the Standby WLC via the Redundant Port. If the NTP server is not configured, a manual time sync is performed from the Active WLC to the Standby WLC on the Redundant Port. This port in case of standalone controller will be assigned an auto generated IP Address where last 2 octets are picked from the last 2 octets of Redundancy Management Interface (the first 2 octets are always 169.254).

 

Note the screenshot showing these as 2 different interfaces. 

 

 

 I´ll stick with the fact that without RD link ok, WLC will split brain as a normal behavior. The redundant management interface, which is a logical link, can be seeing as a double check to avoid unnecessary split brain but, by no mean the HA will stay alive in the event of the RD link is broken. And this is true for WLC and any other system I know that uses HA. Without this physical connection, there will be no HA.

 

 

-If I helped you somehow, please, rate it as useful.-

So what do you think the point of a redundant solution is if a single link failure can cause a complete outage?

 

The table below (from the SSO deployment guide) shows what should happen by design; I'm trying to find out why this particular customer is seeing different behaviour.

 

Network IssuesRP Port StatusPeer Reachable via Redundant ManagementGateway Reachable from ActiveGateway Reachable from StandbySwitchoverResults

Up

Yes

Yes

Yes

No

No Action

Up

Yes

Yes

No

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Up

Yes

No

Yes

Yes

Switchover happens

Up

Yes

No

No

No

No Action

Up

No

Yes

Yes

No

No Action

Up

No

Yes

No

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Up

No

No

Yes

Yes

Switchover happens

Up

No

No

No

No

No Action

Down

Yes

Yes

Yes

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Down

Yes

Yes

No

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Down

Yes

No

Yes

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Down

Yes

No

No

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Down

No

Yes

Yes

Yes

Switchover happens and this may result in Network Conflict

Down

No

Yes

No

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

Down

No

No

Yes

Yes

Switchover happens

Down

No

No

No

No

Standby will reboot and check for gateway reachability. Will go into maintenance mode if still not reachable.

 

Check the 9th line, which shows the scenario in question, standby should reboot to maintenance mode without switchover.

 

I say again, I've done this many times without seeing split brain when the RP fails, just not with this controller / code combo.


@JON SHORTEN wrote:

think you're getting confused by the interface names, the Redundancy Management interface isn't the same thing as the Redundancy Port,


Spelling-wise, RP and Redundancy Management are different.  Function-wise, they are the same.  Redundancy Port is a physical port.  Redundancy Management is an management port (think IP address).  

Look at the IP address of both.

We can debate all year long about this.  

HA SSO got the same "mechanics" as the VSS:  There is a link that links two chassis together and this link does nothing but send and receives "Hello" packets.  Take out that link and both units will go active simultaneously.  

JON SHORTEN
Level 1
Level 1

Replying to my own post to confirm that HA failover works as detailed in the config guide,

 

The issue I initially posted was due to weird behaviour from the gateway (clustered Juniper firewall) causing both WLCs to think they had a reachable gateway for redundancy-management when there was no connectivity between them. (Firewalls went split brain, which caused WLcs to do the same)

 

With correct gateway behavior removing the RP between HA WLCs does NOT cause split brain, CIsco are better at designing HA than that.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: