We have a 2 spoke L2L VPN setup whereby an ASA sits at our main site and 1921 routers sit at two remote sites to act as endpoints. All traffic is directed back to the ASA, there is no direct communication between the remote sites.
This all works as expected, but one of the sites periodically loses its tunnel and fails to reestablish it. Clearing crypto/isakmp does not reestablish the SA, we have to actually reload the router at the site in order to get the tunnel active again.
The last time it happened, I was able to grab some information before it was reloaded. Unfortunately, this happened during a heavy production period so I had limited time to grab what I could since getting it alive again was the immediate priority. I'm hoping for a failure during off hours where I can sit down and run some debugging, but that hasn't happened recently.
Site1-1921#show crypto session detail Crypto session current status Code: C - IKE Configuration mode, D - Dead Peer Detection K - Keepalives, N - NAT-traversal, T - cTCP encapsulation X - IKE Extended Authentication, F - IKE Fragmentation Interface: Serial0/0/0 Session status: DOWN-NEGOTIATING Peer: 216.x.x.x port 500 fvrf: (none) ivrf: (none) Desc: (none) Phase1_id: (none) IKEv1 SA: local 207.x.x.x/500 remote 216.x.x.x/500 Inactive Capabilities:(none) connid:0 lifetime:0 IPSEC FLOW: permit ip 192.168.4.64/255.255.255.224 10.0.0.0/255.0.0.0 Active SAs: 0, origin: crypto map Inbound: #pkts dec'ed 192287 drop 0 life (KB/Sec) 0/0 Outbound: #pkts enc'ed 219073 drop 169 life (KB/Sec) 0/0 IPSEC FLOW: permit ip 192.168.4.64/255.255.255.224 192.168.4.0/255.255.2 Active SAs: 0, origin: crypto map Inbound: #pkts dec'ed 783827 drop 0 life (KB/Sec) 0/0 Outbound: #pkts enc'ed 805580 drop 35 life (KB/Sec) 0/0 IPSEC FLOW: permit ip 192.168.4.64/255.255.255.224 192.168.10.0/255.255. Active SAs: 0, origin: crypto map Inbound: #pkts dec'ed 0 drop 0 life (KB/Sec) 0/0 Outbound: #pkts enc'ed 661 drop 0 life (KB/Sec) 0/0 IPSEC FLOW: permit ip 192.168.4.64/255.255.255.224 192.168.25.0/255.255.255.0 Active SAs: 0, origin: crypto map Inbound: #pkts dec'ed 0 drop 0 life (KB/Sec) 0/0 Outbound: #pkts enc'ed 0 drop 0 life (KB/Sec) 0/0 IPSEC FLOW: permit ip 192.168.4.64/255.255.255.224 172.16.4.0/255.255.255.0 Active SAs: 0, origin: crypto map Inbound: #pkts dec'ed 260258 drop 0 life (KB/Sec) 0/0 Outbound: #pkts enc'ed 280760 drop 4 life (KB/Sec) 0/0 Site1-1921#show crypto isakmp sa IPv4 Crypto ISAKMP SA dst src state conn-id status 216.x.x.x 207.x.x.x MM_NO_STATE 0 ACTIVE 216.x.x.x 207.x.x.x MM_NO_STATE 0 ACTIVE (deleted) IPv6 Crypto ISAKMP SA ASA# show crypto isakmp sa Active SA: 2 Rekey SA: 0 (A tunnel will report 1 Active and 1 Rekey SA during rekey) Total IKE SA: 2 1 IKE Peer: 66.x.x.x Type : L2L Role : initiator Rekey : no State : MM_ACTIVE 2 IKE Peer: 207.x.x.x Type : user Role : initiator Rekey : no State : MM_WAIT_MSG2
207.x.x.x is the remote peer with problems. 66.x.x.x is the stable remote peer. 216.x.x.x is the ASA.
On the show crypto isakmp sa results, I've seen the down Peer stuck in MM_WAIT_MSG3 as well during these incidents, it's not always MSG2.
The router has access to the public internet. That's how I'm getting into it and I'm also able to ping out. Also, like I said, reloading brings everything up no problem. Sometimes the router sits like this for hours, so it's not like the T1s are just coming back up completely during the reload. As far as I can tell, the routes through the public internet between the peers are all good and there is nothing blocking communication. There doesn't seem to be any particular pattern to the failures. Sometimes they're late at night on weekdays, sometimes during the workday and some times on weekends. I can't even tell if the SA is being torn down legitimately because of IPSEC lifetime limits or if something else like an upstream outage is causing the tunnel to be rebuilt.
Here's the crypto config on the ASA:
crypto ipsec transform-set CSM_TS_1 esp-3des esp-md5-hmac crypto ipsec transform-set ESP-AES-256-SHA esp-aes-256 esp-sha-hmac crypto ipsec security-association lifetime seconds 28800 crypto ipsec security-association lifetime kilobytes 4608000 crypto dynamic-map CSM_outside_map_dynamic 2 set transform-set CSM_TS_1 crypto dynamic-map CSM_outside_map_dynamic 2 set reverse-route crypto map CSM_outside_map 10 match address SITE1 crypto map CSM_outside_map 10 set peer 207.x.x.x crypto map CSM_outside_map 10 set transform-set ESP-AES-256-SHA crypto map CSM_outside_map 15 match address SITE2 crypto map CSM_outside_map 15 set peer 66.x.x.x crypto map CSM_outside_map 15 set transform-set ESP-AES-256-SHA crypto map CSM_outside_map 30001 ipsec-isakmp dynamic CSM_outside_map_dynamic crypto map CSM_outside_map interface outside crypto isakmp enable outside crypto isakmp policy 10 authentication pre-share encryption aes-256 hash sha group 2 lifetime 43200 telnet timeout 5 tunnel-group 207.x.x.x type ipsec-l2l tunnel-group 207.x.x.x ipsec-attributes pre-shared-key ***** tunnel-group 66.x.x.x type ipsec-l2l tunnel-group 66.x.x.x ipsec-attributes pre-shared-key *****
The affected site:
crypto isakmp policy 10 encr aes 256 authentication pre-share group 2 lifetime 43200 crypto isakmp key ***** address 216.x.x.x no-xauth crypto isakmp keepalive 20 5 ! ! crypto ipsec transform-set ESP-AES-256-SHA esp-aes 256 esp-sha-hmac mode tunnel ! ! ! crypto map VPN 10 ipsec-isakmp description Tunnel to 216.x.x.x set peer 216.x.x.x set transform-set ESP-AES-256-SHA match address VPN_TUNNEL
Thanks for any help or suggestions you can provide.
I do not see any problem with your configuration. But can you try removing DPD configuration at both the ends?
In router end:
no crypto isakmp keepalive 20 5
In ASA end:
tunnel-group x.x.x.x type ipsec-l2l tunnel-group x.x.x.x ipsec-attributes
isakmp keepalive disable
So we finally got another failure this morning. Here's some debug output I was able to grab:
PAChamberBusiness417WalnutASA# debug crypto isakmp 5 PAChamberBusiness417WalnutASA# Aug 22 09:01:32 [IKEv1]: IP = 207.x.x.x, Duplicate Phase 1 packet detected. Retransmitting last packet. Aug 22 09:01:32 [IKEv1]: IP = 207.x.x.x, P1 Retransmit msg dispatched to MM FSM Aug 22 09:01:42 [IKEv1]: IP = 207.x.x.x, Duplicate Phase 1 packet detected. Retransmitting last packet. Aug 22 09:01:42 [IKEv1]: IP = 207.x.x.x, P1 Retransmit msg dispatched to MM FSM Aug 22 09:01:42 [IKEv1 DEBUG]: IP = 207.x.x.x, IKE MM Responder FSM error history (struct &0xca8fa388) <state>, <event>: MM_DONE, EV_ERROR-->MM_WAIT_MSG3, EV_RESEND_MSG-->MM_WAIT_MSG3, NullEvent-->MM_SND_MSG2, EV_SND_MSG-->MM_SND_MSG2, EV_START_TMR-->MM_SND_MSG2, EV_RESEND_MSG-->MM_WAIT_MSG3, EV_TIMEOUT-->MM_WAIT_MSG3, NullEvent Aug 22 09:01:52 [IKEv1 DEBUG]: IP = 207.x.x.x, Oakley proposal is acceptable Aug 22 09:01:52 [IKEv1 DEBUG]: IP = 207.x.x.x, IKE SA Proposal # 1, Transform # 1 acceptable Matches global IKE entry # 1 Aug 22 09:02:02 [IKEv1]: IP = 207.x.x.x, Duplicate Phase 1 packet detected. Retransmitting last packet. Aug 22 09:02:02 [IKEv1]: IP = 207.x.x.x, P1 Retransmit msg dispatched to MM FSM
Sorry about the debut level. I meant to do 254 but the ssh session was lagging as I typed and I wound up butchering it.
The circuit the troubled site is on is a basic T1 that was converted back in June from MPLS to a simple DIA circuit (which is what started this since we had to then provide our own L2L with the loss of the MPLS mesh).
Here's another fun fact: there were some storms in the area last night, so outage of provider equipment is a possibility.
Is it possible that this sort of error might be caused by NAT or PAT by the ISP that I'm not aware of? I'm trying to get hold of an engineer at the ISP to see if that's happening, but I'm curious if it could be the issue, since it seems like the ASA and the router stop talking to each other when it comes to reopening a dead tunnel.
This kind of error comes when you have the pre-shared key mismatch or negotiation for pre-shared key fails..... can you do one thing..... can you give the pre-shared key @ both ends once again and save the configuration and see if that happens once again.....
Also during the outage if NAT/PAT failure happened in between also would have caused the problem....
I reentered the PSK at both endpoints, so now we play the waiting game again.
Just to clarify, are you agreeing that PAT being done by the ISP is a potential cause worth investigating? If that were happening, would there be anything I could even do on my equipment to work around it?