cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
384
Views
2
Helpful
9
Replies

Bouncing IPsec Tunnel every about 1hr DMVPN

Hi Everyone,

I need some help here. We got spoke sites that have the same behavior. Tunnels bounces every 50-55 mins interval. ISP side is good as multiple tests have been done and we did not see any issues. I noticed isakmp errors on my debug but still nowhere to find the culprit. Other sites are configured the same way but the tunnels are stable.
Here are the notable errors i noticed (also attached is my debug):

Jul 4 16:39:24.174: ISAKMP-ERROR: (1976):My ID configured as IPv4 Addr, but Addr not in Cert!
Jul 4 16:39:24.174: ISAKMP-ERROR: (1976):Using FQDN as My ID

Jul 4 16:44:55.959: ISAKMP-ERROR: (1975):DPD incrementing error counter (5/5)
Jul 4 16:44:55.959: ISAKMP-ERROR: (1975):Peer 203.5.x.x not responding!
Jul 4 16:44:55.959: ISAKMP: (1975):peer does not do paranoid keepalives.
Jul 4 16:44:55.960: ISAKMP: (1975):deleting SA reason "End of ipsec tunnel" state (I) QM_IDLE (peer 203.5.x.x)
Jul 4 16:44:55.960: ISAKMP: (1975):Input = IKE_MESG_FROM_TIMER, IKE_TIMER_PEERS_ALIVE
Jul 4 16:44:55.960: ISAKMP: (1975):Old State = IKE_P1_COMPLETE New State = IKE_P1_COMPLETE
Jul 4 16:44:55.960: ISAKMP-ERROR: (0):Failed to find peer index node to update peer_info_list
Jul 4 16:44:55.962: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel2, changed state to down

Spoke:  ISR4321/K9 16.03.01
Hub: C8500-12X4QC 17.03.02

Config 
Spoke:

crypto isakmp policy 20
encr aes
group 2
crypto isakmp keepalive 65 25
crypto isakmp nat keepalive 20
crypto isakmp profile PROFILE-ISAKMP-4G
vrf INTERNET
ca trust-point dmvpn-sub1-tp
ca trust-point dmvpn-sub2-tp
match identity host domain cmltd.net.au
isakmp authorization list default
!
!
crypto ipsec transform-set AES esp-aes 256 esp-md5-hmac
mode transport require
!
crypto ipsec profile PROFILE-IPSEC-DMVPN-TU1
set transform-set AES
set isakmp-profile PROFILE-ISAKMP-INTERNET
!
crypto ipsec profile PROFILE-IPSEC-DMVPN-TU2
set transform-set AES
set isakmp-profile PROFILE-ISAKMP-INTERNET
!

Hub:

crypto isakmp policy 10
encryption aes
authentication pre-share
group 2
!
crypto isakmp policy 20
encryption aes
group 2
crypto isakmp keepalive 65 25
crypto isakmp profile PROFILE-ISAKMP-DMVPN
vrf INTERNET
keyring KEY-DMVPN
ca trust-point dmvpn-sub1-tp
ca trust-point dmvpn-sub2-tp
match identity host domain cmltd.net.au
!
!
crypto ipsec transform-set AES esp-aes 256 esp-md5-hmac
mode transport require
!
crypto ipsec profile PROFILE-IPSEC-DMVPN-LIQ
set transform-set AES
set isakmp-profile PROFILE-ISAKMP-DMVPN
!
!
!
crypto call admission limit ike in-negotiation-sa 200

Debug attached

 

1 Accepted Solution

Accepted Solutions

I agree, there are issues with DPD propagation between the hub and spoke, which will tear down the tunnel anytime the hub fails to receive 5 DPD-ACK messages from the spoke.  If it occurs roughly every 50 - 55 mins, this may also line up with a rekey failure, as the devices will rekey at roughly the 50 minute mark of the default lifetime, which your devices are configured to use. 

Also I did want to mention that DPD flows initiated from the hub to spoke are independent of DPD flows initiated from the spoke to the hub.  I believe that the logs were taken from the spoke, since I see in there where the device sends the very first phase 1 initiation packet, which is characteristic of spokes in a DMVPN network.  But the logs show that the router is sending DPD R-U-THERE requests and not getting any DPD-ACK packets back.  So, running ISAKMP debugs on the hub would be very helpful in telling us whether it receives these DPD messages from the spoke (or even rekey requests) when the failure occurs. For this you would need to enable conditional crypto debugs for the spoke in question, which will only show debugs from that spoke.

ON THE HUB:
debug crypto condition peer ipv4 <spoek public IP>
debug crypto isakmp

Also adding to running IP SLA/ICMP probes as suggested above, you can tie this to an EEM script that prints a syslog message anytime the underlay (internet) or overlay (Tunnel) has an issue. This is a quick and easy way to help focus on the problem area and whether it appears to be more of an underlay (ISP) or overlay (VPN) issue:

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-embedded-event-manager-eem/113696-eem-tshoot-igp-00.html

View solution in original post

9 Replies 9

As I know only the Spoke need Keepalive Hub not need it, so remove the keepalive from Hub and only run it ON Spoke 

MHM

"As I know only the Spoke need Keepalive Hub not need it, so remove the keepalive from Hub and only run it one Spoke".

The above is completely unrelated to the issue described.

Debug is inconclusive. From the debug we can see that connectivity between peers is lost at some point: BGP session is torn down first, then DPD detects tunnel failure. I'd collect conditional debug on both sides simultaneously and also configure IP SLA on both sides to send ICMP probes over the tunnel, as well as ICMP probes in clear text between tunnel endpoints. The entire picture will become more clear then.

 

 

I agree, there are issues with DPD propagation between the hub and spoke, which will tear down the tunnel anytime the hub fails to receive 5 DPD-ACK messages from the spoke.  If it occurs roughly every 50 - 55 mins, this may also line up with a rekey failure, as the devices will rekey at roughly the 50 minute mark of the default lifetime, which your devices are configured to use. 

Also I did want to mention that DPD flows initiated from the hub to spoke are independent of DPD flows initiated from the spoke to the hub.  I believe that the logs were taken from the spoke, since I see in there where the device sends the very first phase 1 initiation packet, which is characteristic of spokes in a DMVPN network.  But the logs show that the router is sending DPD R-U-THERE requests and not getting any DPD-ACK packets back.  So, running ISAKMP debugs on the hub would be very helpful in telling us whether it receives these DPD messages from the spoke (or even rekey requests) when the failure occurs. For this you would need to enable conditional crypto debugs for the spoke in question, which will only show debugs from that spoke.

ON THE HUB:
debug crypto condition peer ipv4 <spoek public IP>
debug crypto isakmp

Also adding to running IP SLA/ICMP probes as suggested above, you can tie this to an EEM script that prints a syslog message anytime the underlay (internet) or overlay (Tunnel) has an issue. This is a quick and easy way to help focus on the problem area and whether it appears to be more of an underlay (ISP) or overlay (VPN) issue:

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-embedded-event-manager-eem/113696-eem-tshoot-igp-00.html

Thanks All for your suggestion. So I ran debug on the hub router (Attached output: Hub debug.txt)
I noticed a few invalid SPIs that could point out to SAs becoming out-of-sync. So I had to issue the "clear crypto isakmp" and "clear crypto sa" commands but still the same behavior.

Also I noticed in the line below that the input interface is not the supposed exit/entry point of communication. Tunnel300 is meant for the vpn link built over the 4G:LTE underlay. Could this be some sort of asymmetric routing issue? 

Jul 10 08:43:33.866 AEST: %CRYPTO-4-RECVD_PKT_INV_SPI: decaps: rec'd IPSEC packet has invalid spi for destaddr=203.y.y.100, prot=50, spi=0x55F79068(1442287720), srcaddr=203.x.x.6, input interface=Tunnel300

Hoping for your inputs/suggestions. Thank you All.

At 9:39:21, we see an invalid SPI message:

  Jul 10 09:39:21.274 AEST: %CRYPTO-4-RECVD_PKT_INV_SPI: decaps: rec'd IPSEC packet has invalid spi for destaddr=203.y.y.100, prot=50, spi=0x41E4DA50(1105517136), srcaddr=203.x.x.6, input interface=Tunnel300

About 53 minutes later, we can see the hub flagging a new SPI with this peer...

  Jul 10 10:36:23.528 AEST: %CRYPTO-4-RECVD_PKT_INV_SPI: decaps: rec'd IPSEC packet has invalid spi for destaddr=203.y.y.100, prot=50, spi=0xD6D55B01(3604306689), srcaddr=203.x.x.6, input interface=Tunnel300

...implying that there was another tunnel drop and rebuild that occurred ~53 minutes ago. But there are no debugs illustrating this, which would have happened shortly after 9:39:21. So I'm guessing that the debug hadn't been turned on yet.

However, a minute later, we do see the following:

  Jul 10 10:37:31.466 AEST: ISAKMP-PAK: (0):received packet from 203.x.x.6 dport 500 sport 500 INTERNET (N) NEW SA

And this is followed up with a set of debugs showing the tunnel building from scratch.

What this suggests is that the remote peer (203.x.x.6) terminated boith its phase 1 and phase 2 SAs right before sending this offer. But I don't see where the tunnel gets torn down in the logs prior to this message/event. Each time the SPI error transitions to a new value, we should see debug telling us that the hub is tearing down its SAs with the peer and building a tunnel from scratch (or processing a phase 2 rekey).

Given the logs we saw on the spoke, I suspect that the tunnel is being torn down by missed DPD messages send from the spoke to the hub. Unfortunately, I don't see any DPD messages from the spoke show up in the debugs taken from the hub. This may be due to the fact that there weren't debug running prior to the tunnel being torn down, and therefore didn't catch it in the act.

Nonetheless, I would recommend trying to disable DPDs from being sent from the spoke to the hub to see if that helps stabilize things. Can you try the following on the spoke?

  no crypto isakmp keepalive

The tunnel would need to be renegotiated before this change takes effect. A regular rekey should suffice if you are unable to bounce the tunnel from the spoke end.

 

Thanks 

MHM

Issue has been resolved. It turned out the firewall is dropping UDP 500 packets initiated by the hub. Spoke to hub communication is good since there's a firewall policy allowing it. The problem is it's just a unidirectional policy. We created a bi-directional policy and it resolved the issue. Thank you all for your inputs. 

But hub not need to initiate traffic 

Only spoke do that, and fw will allow retrun traffic 

Did yoh try disable dpd (keepalive) in hub abd enable it in spokes only??

MHM