Re: ASA Dual-ISP Backup VPN Failover Flapping

Jesse Shumaker · ‎01-22-2018

I have a scenario where when the Primary ISP (the ISP with a tracked route fails) the backup ISP takes over and the remote L2L tunnels begin flapping between the primary ISP and the backup ISP. This only occurs on failover to the backup ISP. If I failback to the primary ISP the tunnel flapping stops.

crypto map VPN_Salas interface Qwest_Backup
crypto map VPN_Salas interface Cox_Primary
crypto isakmp identity address
crypto ikev1 enable Qwest_Backup
crypto ikev1 enable Cox_Primary <<<< If I remove this entry the Backup ISP tunnels establish and the flapping stops.

This must be code related, because we upgraded from a 5505 to a 5506 and started seeing these issues. I'm running 9.2. Whats confusing is I thought that when a route track goes down and the route is pulled from the routing table all associations to that Interface name are removed from service until restoration. So I don't know why the ASA thinks the "Cox_Primary" connection is still in services to initiate tunnel requests.

Ajay Saini · ‎01-22-2018

Hello,

Can you please share your sla monitoring configuration. It is possible that sla monitoring is flapping at the time and causing the links to switch between 2 ISPs.

Also, syslogs during this time will be a good help to find if the sla monitoring is working as expected.

Regards,

AJ

Jesse Shumaker · ‎01-23-2018

When the failover is active I have physically shutdown the port to the Cox_Primary Interface. so the routing tables looks like this with no entries to Cox_Primary. This is what has surprised me about the VPN tunnel flapping being attempted through the Cox_Primary Interface.:

Gateway of last resort is [qwest_gateway_per_ip] to network 0.0.0.0

S* 0.0.0.0 0.0.0.0 [254/0] via [qwest_gateway_per_ip], Qwest_Backup
C [qwest_ip_subnet] 255.255.255.248 is directly connected, Qwest_Backup
L [qwest_cer_ip] 255.255.255.255 is directly connected, Qwest_Backup
C 169.254.1.0 255.255.255.252 is directly connected, nlp_int_tap
L 169.254.1.1 255.255.255.255 is directly connected, nlp_int_tap
C 192.168.1.0 255.255.255.0 is directly connected, LANInterface
L 192.168.1.1 255.255.255.255 is directly connected, LANInterface

route Cox_Primary 0.0.0.0 0.0.0.0 [cox_gateway_per_ip] 1 track 1
route Qwest_Backup 0.0.0.0 0.0.0.0 [qwest_gateway_per_ip] 254

sla monitor 100
type echo protocol ipIcmpEcho 8.8.8.8 interface Cox_Primary
num-packets 3
frequency 10
sla monitor schedule 100 life forever start-time now

mvsheik123 · ‎01-23-2018

Hi,

What is the configuration on other end? If the other end got 2 crypto map statements/configs with 'Cox_Primary' as high priority (low number), they try to initiate tunnel to primary IP. If that is the case, as test remove related crypto map config when you admin down primary ISP interface.

hth

MS

Jesse Shumaker · ‎01-23-2018

Here is one of the other sites. There are 5 other sites which have L2L tunnels to this flapping site.

crypto ipsec ikev1 transform-set ESP-3DES-SHA esp-3des esp-sha-hmac
crypto ipsec security-association pmtu-aging infinite
crypto map VPN 1 match address Central_North_VPN_Traffic
crypto map VPN 1 set peer [Cox_Primary_IP] [Qwest_Backup_IP]
crypto map VPN 1 set ikev1 transform-set ESP-3DES-SHA
crypto map VPN interface primary
crypto map VPN interface secondary
crypto ca trustpool policy
crypto isakmp identity address
crypto ikev1 enable primary
crypto ikev1 enable secondary
crypto ikev1 am-disable
crypto ikev1 policy 1
 authentication pre-share
 encryption 3des
 hash sha
 group 2
 lifetime 28800
!
tunnel-group [Cox_Primary_IP] type ipsec-l2l
tunnel-group [Cox_Primary_IP] ipsec-attributes
 ikev1 pre-shared-key *****
!
tunnel-group [Qwest_Backup_IP] type ipsec-l2l
tunnel-group [Qwest_Backup_IP] ipsec-attributes
 ikev1 pre-shared-key *****

"If that is the case, as test remove related crypto map config when you admin down primary ISP interface."

This is a good idea and as you can see how I have the config setup. How would the tunnel request to the [Cox_Primary_IP] ever get serviced if that IP isn't reachable, since I've admined down the interface this IP belongs to? What I'm finding is the remote sites actually recieve a teardown

Here's the ip sla stats during failover tonight

Entry number: 100
Modification time: 16:16:36.033 ARIZONA Sun Jan 21 2018
Number of Octets Used by this Entry: 2056
Number of operations attempted: 16731
Number of operations skipped: 0
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 14:44:46.035 ARIZONA Tue Jan 23 2018
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0       RTTMin: 0       RTTMax: 0
NumOfRTT: 0     RTTSum: 0       RTTSum2: 0

Here is the data giving proof that the [Cox_Primary_IP] is still initiating a request. Somehow traffic is being generated from this IP even though the interface is down. As you can see the site with the VPN config above is receiving a a teardown request from this [Cox_Primary_IP]

PHASE 2 COMPLETEs to the Qwest_Backup_IP

Jan 23 19:48:30 [IKEv1]Group = Qwest_Backup_IP, IP = Qwest_Backup_IP, PHASE 2 COMPLETED (msgid=efe4e93d)

We see new phase 1 begin from [Cox_Primary_IP] than MSG2's which represent an unreachable connection.

Jan 23 19:48:58 [IKEv1]Group = Qwest_Backup_IP, IP = Qwest_Backup_IP, Connection terminated for peer Qwest_Backup_IP. Reason: Peer Terminate Remote Proxy 192.168.1.0, Local Proxy 192.168.2.0 Jan 23 19:48:58 [IKEv1 DEBUG]Group = Qwest_Backup_IP, IP = Qwest_Backup_IP, Active unit receives a delete event for remote peer Qwest_Backup_IP 
Jan 23 19:48:58 [IKEv1 DEBUG]Group = Qwest_Backup_IP , IP = Qwest_Backup_IP , IKE Deleting SA: Remote Proxy 192.168.1.0, Local Proxy 192.168.2.0 
Jan 23 19:49:06 [IKEv1]IP = Cox_Primary_IP , IKE Initiator: New Phase 1, Intf inside, IKE Peer Cox_Primary_IP local Proxy Address 192.168.2.0, remote Proxy Address 192.168.1.0, Crypto map (VPN) Jan 23 19:49:38 [IKEv1 DEBUG]IP = Cox_Primary_IP , IKE MM Initiator FSM error history (struct &0x00002aaac232a9f0) <state>, <event>: MM_DONE, EV_ERROR-->MM_WAIT_MSG2, EV_RETRY-->MM_WAIT_MSG2, EV_TIMEOUT-->MM_WAIT_MSG2, NullEvent-->MM_SND_MSG1, EV_SND_MSG-->MM_SND_MSG1, EV_START_TMR-->MM_SND_MSG1, EV_RESEND_MSG-->MM_WAIT_MSG2, EV_RETRY

This cycle of completing PHASE 2 to the Qwest_Backup_IP and tearing it down keeps continuing until the Cox_Primary_IP is restored.

Jesse Shumaker · ‎01-23-2018

very similar to this flapping site

crypto ipsec ikev1 transform-set ESP-3DES-SHA esp-3des esp-sha-hmac
crypto ipsec security-association pmtu-aging infinite
crypto map VPN 1 match address Central_North_VPN_Traffic
crypto map VPN 1 set peer Cox_Primary_IP Qwest_Backup_IP
crypto map VPN 1 set ikev1 transform-set ESP-3DES-SHA
crypto map VPN interface primary
crypto map VPN interface secondary
crypto ca trustpool policy
crypto isakmp identity address
crypto ikev1 enable primary
crypto ikev1 enable secondary
crypto ikev1 am-disable
crypto ikev1 policy 1
 authentication pre-share
 encryption 3des
 hash sha
 group 2
 lifetime 28800

whats strange is phase 2 completes to the qwest_primary_ip and than there's a teardown request to from the cox_Primary_ip. .

Jan 23 19:48:30 [IKEv1]Group = Qwest_Backup_IP, IP = Qwest_Backup_IP, PHASE 2 COMPLETED (msgid=efe4e93d)
Jan 23 19:48:58 [IKEv1]Group = Qwest_Backup_IP, IP = Qwest_Backup_IP, Session is being torn down. Reason: User Requested 
Jan 23 19:48:58 [IKEv1]Ignoring msg to mark SA with dsID 1073152 dead because SA deleted 
Jan 23 19:49:06 [IKEv1 DEBUG]Pitcher: received a key acquire message, spi 0x0 
Jan 23 19:49:06 [IKEv1]IP = Cox_Primary_IP, IKE Initiator: New Phase 1, Intf inside, IKE Peer Cox_Primary_IP local Proxy Address 192.168.2.0, remote Proxy Address 192.168.1.0, Crypto map (VPN)

This cycle continuous of phase 2 complets and teardowns until I restore service to the Cox_PRimary_IP.

mvsheik123 · ‎01-25-2018

Hi,

Make sure no phase1/phase2 session got stuck when you bring down the primary. Use 'Clear crypto' commands to clear any such connections (it is hard to see that happen..but based on your issue, make sure thats not the case).

Test by enabling 'isakmp keepalive threshold retry' under tunnel-group commands.

hth

MS

Jesse Shumaker · ‎01-25-2018

The other tunnels do clear out. I continuously run this command and see them all clear and begin establishing on the Qwet_Backup_IP. I believe this is occurring because somehow that Cox_Primary_IP is sending traffic. could the Qwet_Backup_IP tunnels be encapsulating traffic with an inner IP as the Cox_Primary_IP? this is a long stretch, just out of ideas. the default tunnel group has this keepalive enabled. Is that enough?

sh cryp isa sa

tunnel-group DefaultRAGroup ipsec-attributes
 isakmp keepalive threshold 10 retry 2

mvsheik123 · ‎01-28-2018

Hi,

Try use this under primary tunnel group.

isakmp keepalive threshold 10 retry 2

Apart from VPN behaviour, how is the internet access via secondary provider. You notice any discrepancy? You noticed it only after upgrading ASA to 5506 (if I remember correct model). If so, did you check for any bugs in the code you are running?

Thx

MS

Jesse Shumaker · ‎01-29-2018

ok should this be at all sites including this problem site or just the problem site?

isakmp keepalive threshold 10 retry 2

Apart from VPN behaviour, how is the internet access via secondary provider. You notice any discrepancy? You noticed it only after upgrading ASA to 5506 (if I remember correct model). If so, did you check for any bugs in the code you are running?

the internet is working great via 2nd provider. no issues. yes, just after upgrading to 5506. I have the latest code on:

Release 9.6.2 Interim