Solved: Re: 9300 FTD Cluster S2S VPN Failover

darren-oconnor · ‎09-17-2024

Looking for some clarity please around the failover of a s2s VPN terminated on a FTD 9300 Cluster.

Cisco Secure Firewall Management Center Device Configuration Guide, 7.2 - Clustering for the Firepower 4100/9300 [Cisco Secure Firewall Management Center] - Cisco - The FMC configuration guide states:

"VPN functionality is limited to the control unit and does not take advantage of the cluster high availability capabilities. If the control unit fails, all existing VPN connections are lost, and VPN users will see a disruption in service. When a new control unit is elected, you must reestablish the VPN connections."

So my question is - what happens in the real world when the Control node fails - the other node in the cluster should become the active Control node after 3 keepalives are lost.... So shouldn't the S2S VPN come up on the new active Control node? Expected outage of around XY seconds?

Thank you.

Sheraz.Salim · ‎09-22-2024

Based on the information provided in the Cisco Secure Firewall Management Center Device Configuration Guide, here's what happens when the control unit fails in a Firepower 4100/9300 cluster with site-to-site VPN:

VPN functionality is limited to the control unit only and does not take advantage of cluster high availability capabilities.
When the control unit fails:
- All existing VPN connections are lost
- VPN users will experience a disruption in service
A new control unit is elected after the failure is detected. This typically occurs after 3 missed keepalives, which can take several seconds.
However, the VPN connections do not automatically re-establish on the new control unit. The guide explicitly states that "you must reestablish the VPN connections" when a new control unit is elected.
This means that while the cluster itself may recover relatively quickly (within seconds to a minute), the VPN connections require manual intervention to be restored.

The expected outage duration for the VPN connections would be longer than just the time it takes for a new control unit to be elected. It would include:

-Time for failure detection (3 missed keepalives)

-Time for new control unit election

-Time for manual intervention to reestablish VPN connections

In the "real world" scenario you described, even though a new control node becomes active quickly, the S2S VPN does not automatically come up on the new active control node. This is a limitation of how VPN functionality is implemented in the clustering feature for these devices. To minimize downtime in production environments, you would likely need to have procedures in place for quickly reestablishing VPN connections after a control unit failure, or consider alternative designs that don't rely solely on clustering for VPN high availability.

please do not forget to rate.

View solution in original post

MHM Cisco World · ‎09-20-2024

I think the doc. Is not considering use ipsec keepalive that why you need to reestablish vpn connect.

MHM

Sheraz.Salim · ‎09-22-2024

Based on the information provided in the Cisco Secure Firewall Management Center Device Configuration Guide, here's what happens when the control unit fails in a Firepower 4100/9300 cluster with site-to-site VPN:

VPN functionality is limited to the control unit only and does not take advantage of cluster high availability capabilities.
When the control unit fails:
- All existing VPN connections are lost
- VPN users will experience a disruption in service
A new control unit is elected after the failure is detected. This typically occurs after 3 missed keepalives, which can take several seconds.
However, the VPN connections do not automatically re-establish on the new control unit. The guide explicitly states that "you must reestablish the VPN connections" when a new control unit is elected.
This means that while the cluster itself may recover relatively quickly (within seconds to a minute), the VPN connections require manual intervention to be restored.

The expected outage duration for the VPN connections would be longer than just the time it takes for a new control unit to be elected. It would include:

-Time for failure detection (3 missed keepalives)

-Time for new control unit election

-Time for manual intervention to reestablish VPN connections

In the "real world" scenario you described, even though a new control node becomes active quickly, the S2S VPN does not automatically come up on the new active control node. This is a limitation of how VPN functionality is implemented in the clustering feature for these devices. To minimize downtime in production environments, you would likely need to have procedures in place for quickly reestablishing VPN connections after a control unit failure, or consider alternative designs that don't rely solely on clustering for VPN high availability.

please do not forget to rate.

darren-oconnor · ‎09-23-2024

Thank you @Sheraz.Salim. This also aligns with a discussion i had with an SE also.