Re: AWS to ASR vpn FLAPPING : CSCvu57698 - ENH The ablity to adjust the rekey margin time for IPSE...

issofunky · ‎08-19-2020

AFTER FIGHTING FLAP AFTER FLAP and calling TAC / ASW Support for WEEKS (months)

I located the true root cause of this issue between AWS transit gateway and AWS flapping.

AWS introduced a concept into VPN design that they label " Rekey Fuzz (percentage) "

They DEFAULT the fuzz to 100 (%).

The ASR's "margin" for phase 2 Rekey is 85% of full lifetime ( RFC 6407 )

and at the end of the negotiations Cisco WAITS to use the NEW SPI until after the FULL LIFETIME EXPIRES.

and at the end of the negotiations AWS REFUSES TO WAIT and begins use of the NEW SPI immediately.

Triggers invalid SPI and a DELETE SPI response (from the cisco toward AWS)

the FUZZ of 100% results in rekey starting as early as 480 seconds or as LATE as 0 seconds (guaranteed flap event)

I've tried NO Fuzz....and that seems to help (fixed AWS margin of 240 seconds)

I've found that anything less than 180 seconds.....flapping resumes and becomes worse.

>360 seconds and you will see flaps, when the INVALID SPI is triggered, and it often WON'T RE-TRY the Re-Key....so Phase 2 will fail....and STAY FAILED for 59 more minutes....woops.

Try 240 seconds, with NO fuzz, and set the CISCO end to 1 hour......for me, this was the most stable I've been able to establish.

WOULD LOVE to hear comments on this.

Especially interested to hear from avid ISC2 / RFC / IPSec Standard enthusiast.

CSCvu57698

issofunky · ‎09-17-2020

patching to version 16.9.5. resolved this issue.

NOTE: there was another bug suffering our situation : SPI was keyed with SPI that Led with xff____________ Cisco's phase 2 rejects spi that lead with hex ff

patch for 16.9.5 resolved the xff and with it, most of our flapping.

the OTHER portion of flapping was resolved by:

a. lowering aws's feature "FUZZ" on the phase 2 rekey to 20% or less

b. Using their NEW feature for DPD time-out action "Restart": this feature claims only for Ikev2 but it does appear to fully resolve issues with DPD timers in ikeV1 as well.

c. DISABLE all the default CHECK BOXES for ALL the higher encryption options and DH options, and enable ONLY what you are static set for on the cisco side.

AWS will try the highest first....and slowly time out and try again with the lower until it finds a match.....which is as it should be, but it puts the onus on us to re-set their vpn options defaults to match only what we really want.

After this.....12 days...not a single flap since IOS update 16.9.5

on ASR 1001 HX.

Daniel Anderson · ‎01-05-2021

Hi there

Apologies for resurrecting quite an old thread, but I'm fairly sure we're hitting the same issues as you. We currently have some ASR-1002HX Routers that are peering with AWS - Direct Connect and then VPNs over the top. We're seeing drops on some connections every 8 hours of so in line with the Phase 1 rekey interval

From a Router perspective, are you able to advise which global settings you had in place for IPSec Phase 1 and 2. We're running 16.9.6 which seems a relatively stable version of code, but still hitting issue. The main thing I've noticed is that the connections that we're having impact on, our Router is the Src of the Phase 1 connectivity, all the other VPN connections, AWS is the Source and they appear to be stable. Main log messages are:

%CRYPTO-4-IKMP_NO_SA: IKE message from A*WSIP* has no SA and is not an initialization offer

Currently troubleshooting this with our Cisco partner and AWS, AWS have mentioned about the fuzzy rekey, but I'm less inclined to be making changes on the AWS VPN side as we have so many VPNs and the majority of them are stable

issofunky · ‎01-05-2021

You've got to focus on these items to get you to resolution.

STRONG WARNTNG!!! When changing the FUZZ and REKEY......it got WORSE before I found the setting that got it perfect.
SMALL and CAREFULLY PLANNED ADJUSTMENTS TO RE-KEY ARE ESSENTIAL!!!!

#1 in AWS....open EACH problem tunnel's "MODIFY VPN TUNNEL OPTIONS" section.....disable EVERYTHING EXCEPT the Phase1 / Phase 2 you have STATIC CONFIGURED ON the CISCO side.
THIS IS THE KEY TO SOLVE: AWS starts at the HIGHEST possible every thing and defaults to ALL ON. So the timeouts need to fail 12 different options for encryption standard before arriving at your set AES/SHA......you can believe the tunnel WILL falter at every rekey.
#2. Set rekey fuzz to 0 and Margin to 240 .....this ASSUMES your cisco is rekeying phase 2 at 1 hour and phase 1 every 8 hours
STRONG WARNTNG!!! When changing the FUZZ and REKEY......it got WORSE before I found the setting that got it perfect.
SMALL and CAREFULLY PLANNED ADJUSTMENTS TO RE-KEY ARE ESSENTIAL!!!!

#3. Check the settings at the bottom for "DPD" timeout actions.......even though their documentation says EXPLICITLY that it is only for IKev2 ....I've found that this helped on the IKev1 tunnels I'm using (unsure if that's just placebo effect)
DPD Timeout Action
Clear
Restart
None
Startup Action
Add
Start

On the CISCO side.....ensure:
set security-association lifetime kilobytes disable
crypto ipsec security-association replay disable
crypto isakmp invalid-spi-recovery

and....as best practice, define a unique profile for each tunnel .....so you can adjust ONE and not all at once.

Get to code version
Cisco IOS XE Software, Version 16.09.05
Cisco IOS Software [Fuji], ASR1000 Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 16.9.5, RELEASE SOFTWARE (fc1)
System image file is "bootflash:/asr1000-universalk9.16.09.05.SPA.bin"

THIS GROUP of actions ENDED my problem.....but again

STRONG WARNTNG!!! When changing the FUZZ and REKEY......it got WORSE before I found the setting that got it perfect.
SMALL and CAREFULLY PLANNED ADJUSTMENTS TO RE-KEY ARE ESSENTIAL!!!!

Best of luck on planning and maintaining your windows/roll back plan.

AWS to ASR vpn FLAPPING : CSCvu57698 - ENH The ablity to adjust the rekey margin time for IPSEC SPIs