AFTER FIGHTING FLAP AFTER FLAP and calling TAC / ASW Support for WEEKS (months)
I located the true root cause of this issue between AWS transit gateway and AWS flapping.
AWS introduced a concept into VPN design that they label " Rekey Fuzz (percentage) "
They DEFAULT the fuzz to 100 (%).
The ASR's "margin" for phase 2 Rekey is 85% of full lifetime ( RFC 6407 )
and at the end of the negotiations Cisco WAITS to use the NEW SPI until after the FULL LIFETIME EXPIRES.
and at the end of the negotiations AWS REFUSES TO WAIT and begins use of the NEW SPI immediately.
Triggers invalid SPI and a DELETE SPI response (from the cisco toward AWS)
the FUZZ of 100% results in rekey starting as early as 480 seconds or as LATE as 0 seconds (guaranteed flap event)
I've tried NO Fuzz....and that seems to help (fixed AWS margin of 240 seconds)
I've found that anything less than 180 seconds.....flapping resumes and becomes worse.
>360 seconds and you will see flaps, when the INVALID SPI is triggered, and it often WON'T RE-TRY the Re-Key....so Phase 2 will fail....and STAY FAILED for 59 more minutes....woops.
Try 240 seconds, with NO fuzz, and set the CISCO end to 1 hour......for me, this was the most stable I've been able to establish.
WOULD LOVE to hear comments on this.
Especially interested to hear from avid ISC2 / RFC / IPSec Standard enthusiast.
patching to version 16.9.5. resolved this issue.
NOTE: there was another bug suffering our situation : SPI was keyed with SPI that Led with xff____________ Cisco's phase 2 rejects spi that lead with hex ff
patch for 16.9.5 resolved the xff and with it, most of our flapping.
the OTHER portion of flapping was resolved by:
a. lowering aws's feature "FUZZ" on the phase 2 rekey to 20% or less
b. Using their NEW feature for DPD time-out action "Restart": this feature claims only for Ikev2 but it does appear to fully resolve issues with DPD timers in ikeV1 as well.
c. DISABLE all the default CHECK BOXES for ALL the higher encryption options and DH options, and enable ONLY what you are static set for on the cisco side.
AWS will try the highest first....and slowly time out and try again with the lower until it finds a match.....which is as it should be, but it puts the onus on us to re-set their vpn options defaults to match only what we really want.
After this.....12 days...not a single flap since IOS update 16.9.5
on ASR 1001 HX.
Apologies for resurrecting quite an old thread, but I'm fairly sure we're hitting the same issues as you. We currently have some ASR-1002HX Routers that are peering with AWS - Direct Connect and then VPNs over the top. We're seeing drops on some connections every 8 hours of so in line with the Phase 1 rekey interval
From a Router perspective, are you able to advise which global settings you had in place for IPSec Phase 1 and 2. We're running 16.9.6 which seems a relatively stable version of code, but still hitting issue. The main thing I've noticed is that the connections that we're having impact on, our Router is the Src of the Phase 1 connectivity, all the other VPN connections, AWS is the Source and they appear to be stable. Main log messages are:
%CRYPTO-4-IKMP_NO_SA: IKE message from A*WSIP* has no SA and is not an initialization offer
Currently troubleshooting this with our Cisco partner and AWS, AWS have mentioned about the fuzzy rekey, but I'm less inclined to be making changes on the AWS VPN side as we have so many VPNs and the majority of them are stable