We built a Transit VPC environment with Cisco CSR routers and two CSRs are deployed in Ireland Region. I have a production VPC in the same region and it is connected to Transit VPC via VPN connection. This connection is Dynamic VPN with BGP. It is observed that the IKE and Phase 2 Re-keying taking sometimes more than 20 minutes and it make the tunnel down for a while and BGP neighbor-ship also reset. When this occurs, it is observed that both Tunnels in AWS VPN console shows as Down. we have used exact configuration downloaded from amazon console and this issue persists for tunnels to AWS environment. We raised a ticket with AWS and they informed that VPN goes down after VPN crypto timers expired and Phase1 and phase 2 re-established after some delay. We need do to something urgently to avoid this blackout minutes as it impacts the production. The BGP summary output shows in the section below:
euiepcmcsr001#sh ip bgp all sum
For address family: VPNv4 Unicast
BGP router identifier 172.31.210.225, local AS number 65000
BGP table version is 37300, main routing table version 37300
113 network entries using 28928 bytes of memory
163 path entries using 19560 bytes of memory
26/10 BGP path/bestpath attribute entries using 6864 bytes of memory
6 BGP AS-PATH entries using 208 bytes of memory
7 BGP extended community entries using 252 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 55812 total bytes of memory
34 received paths for inbound soft reconfiguration
BGP activity 726/604 prefixes, 32531/32355 paths, scan interval 60 secs
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.126.0.2 4 65000 1835964 1837238 37300 0 0 28w6d 5
10.126.1.2 4 65001 1674 1675 37300 0 0 04:24:03 7
169.254.20.129 4 9059 8335 8752 37300 0 0 23:08:48 1
169.254.20.173 4 9059 2907 3058 37300 0 0 08:04:03 1
169.254.20.197 4 9059 8730 9178 37300 0 0 1d00h 1
169.254.22.145 4 9059 13366 14064 37300 0 0 1d13h 1
169.254.22.149 4 9059 6990 7023 37300 0 0 18:33:35 1
169.254.23.17 4 9059 13400 14091 37300 0 0 1d13h 1
169.254.23.97 4 9059 14367 14633 37300 0 0 1d14h 1
169.254.23.209 4 9059 2145 2179 37300 0 0 05:46:01 1
169.254.23.229 4 9059 6080 6108 37300 0 0 16:08:48 1
169.254.23.249 4 9059 5548 5827 37300 0 0 15:24:19 1
You can find that the AWS neighbor ship not lasting for more than a day for most of the connection. Only the first connection is stable, which is a normal GRE tunnel to other CSR router with in the VPC. It looks like rekey process have some delay to form the neighbor ship.
When this issue occurs, the following logs were found in the CSR logs.
*Jun 13 07:09:12.051: %BGP-3-NOTIFICATION: sent to neighbor 169.254.20.129 4/0 (hold time expired) 0 bytes
*Jun 13 07:09:12.051: %BGP-5-NBR_RESET: Neighbor 169.254.20.129 reset (BGP Notification sent)
*Jun 13 07:09:12.052: %BGP-5-ADJCHANGE: neighbor 169.254.20.129 vpn vrf cz-vpc Down BGP Notification sent
*Jun 13 07:09:12.052: %BGP_SESSION-5-ADJCHANGE: neighbor 169.254.20.129 IPv4 Unicast vpn vrf cz-vpc topology base removed from session BGP Notification sent
*Jun 13 07:59:53.063: %BGP-5-ADJCHANGE: neighbor 169.254.20.129 vpn vrf cz-vpc Up
*Jun 13 14:34:45.023: %CRYPTO-6-ISAKMP_MANUAL_DELETE: IKE SA manually deleted. Do 'clear crypto sa peer 126.96.36.199' to manually clear IPSec SA's covered by this IKE SA.
*Jun 13 14:35:38.596: %BGP-3-NOTIFICATION: sent to neighbor 169.254.20.197 4/0 (hold time expired) 0 bytes
*Jun 13 14:35:38.596: %BGP-5-NBR_RESET: Neighbor 169.254.20.197 reset (BGP Notification sent)
*Jun 13 14:35:38.596: %BGP-5-ADJCHANGE: neighbor 169.254.20.197 vpn vrf eu-derd-vpc Down BGP Notification sent
*Jun 13 14:35:38.596: %BGP_SESSION-5-ADJCHANGE: neighbor 169.254.20.197 IPv4 Unicast vpn vrf eu-derd-vpc topology base removed from session BGP Notification sent
The aws techsupport informed us the following in their logs:
VGW IP: "188.8.131.52"
2017-06-13T14:35:04 UTC Phase1 expired
2017-06-13T14:35:26 UTC Phase2 expired
2017-06-13T14:35:36 UTC New Phase1 came up
2017-06-13T14:48:36 UTC New phase2 came up SPI 0x37e16be8
VGW IP: "184.108.40.206"
2017-06-13T14:31:54 UTC Phase1 expired
2017-06-13T14:32:19 UTC Phase2 expired
2017-06-13T14:32:29 UTC New Phase1 came up
2017-06-13T14:58:36 UTC New Phase2 came up SPI 0xa4bdc670
There is delay between phase 1 and phase 2 come up.
No one has logged in and manually change clear any SAs in any of the routers. Please help to find root cause of this issue.
To mitigate this issue you can allow incoming Phase 1 traffic into CSR from AWS VPN device and that will resolve the issue as now AWS VPN end point renegotiate phase 1 rekey few minutes earlier than 8-hour expiry instead of Cisco router.
We are also looking into this bug and will solve it with a bug fix and new image to be published on AWS
What version does this affect? I am experiencing the same issue with the version information below.
Cisco IOS XE Software, Version 16.05.01b
Cisco IOS Software [Everest], Virtual XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 16.5.1b, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2017 by Cisco Systems, Inc.
Compiled Tue 11-Apr-17 16:41 by mcpre
Cisco IOS-XE software, Copyright (c) 2005-2017 by cisco Systems, Inc.
All rights reserved. Certain components of Cisco IOS-XE software are
licensed under the GNU General Public License ("GPL") Version 2.0. The
software code licensed under GPL Version 2.0 is free software that comes
with ABSOLUTELY NO WARRANTY. You can redistribute and/or modify such
GPL code under the terms of GPL Version 2.0. For more details, see the
documentation or "License Notice" file accompanying the IOS-XE software,
or the applicable URL provided on the flyer accompanying the IOS-XE
ROM: IOS-XE ROMMON
uptime is 2 weeks, 3 days, 23 hours, 0 minutes
Uptime for this control processor is 2 weeks, 3 days, 23 hours, 2 minutes
System returned to ROM by reload
System image file is "bootflash:csr1000v-universalk9.16.05.01b.SPA.bin"
Cisco IOS XE Software, Version 03.16.04a.S - Extended Support Release
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.5(3)S4a, RELEASE SOFTWARE (fc1)
Just to update this issue. This is not a bug. It's due to different IPSEC implementation between CSR and AWS VGW (Virtual Private Gateway). This is only applicable for IKEv1 and we are working with AWS to see how to turn on IKEv2 to solve this issue.
In meanwhile, you can open UDP500/4500 for the SG(Security Group) which has been applied to CSR.
I have pretty much the same issue but with a ISR4431, I'm gettint error message to manually clear isakmp on ipsec tunnels with AWS
%CRYPTO-6-ISAKMP_MANUAL_DELETE: IKE SA manually deleted. Do 'clear crypto sa peer
is this cosmetic? because the interface tunnel are not going down, nor the bgp sessions?
What exactly should I modify from the configuration AWS provides for cisco IOS device