Solved: Flexvpn tunnel source interface dynamic change is failing - why?

cco@leferguson.com · ‎09-21-2018

This is sadly not a simple thing to describe. Short version: I can't get FLEXVPN to shift from one tunnel to another dynamically even though each works separately.

Longer version: I'll first describe the conceptual problem, and then give some actual detailed configs in the first reply.

We have two 800 series routers, a C819G-LTE-NMA-K9 running 15.4(3r)M1 (called "Portable1") which is used as a portable emergency connection, and a C892FSP-K9 which is used for tunnel termination inside our headquarters, located behind an ASA which passes the tunnels through for decryption on the C892FSP.

The goal is to make the Portable1 a flexible replacement carrier. We have a couple dozen locations with fairly redundant fiber and microwave links between them, but this is a public safety network so as a latch ditch failure mode we have the portable to physically carry to a location and plug into that network (which might be an isolated island of several locations depending on failure mode, or might be a mobile command vehicle). The goal is to let EIGRP figure out what is visible in that island, and connect to the main network and rejoin, in a zero-configuration-change goal. That works great. We have tested it quite a bit.

Further if the problem location happens to have a working ISP (many do) we want to plug into that instead of using cellular, since it is likely faster and lower latency. Again, the goal is no router config changes - it needs to figure it out and do the right thing. And that works, and has been pretty thoroughly tested with both cellular and ISP connections.

Finally (and herein lies the issue), if we are running on cellular, and they plug in an ISP it should shift to it. If later the ISP fails, it should shift to Cellular. That process is failing, but not in the way I would expect -- it shifts tunnels origin interface, but the tunnel will not come up properly, even though the same exact tunnel comes up if it is the first tunnel chose after a reboot. Further each tunnel works fine if interrupted and re-established in place (as opposed to shifting to the other tunnel). It is the change from one tunnel source to the other that fails.

The mechanics of those changes work perfectly, meaning the event manager code and FlexVPN path selection works; the tunnel source changes, the interfaces are shut down (or not) as appropriate, and the routing changes properly - I can log in and test and see all that (from a separate connection). However, when the tunnel source interface changes (and associated IP), the IKEV2 tunnels do not come up after the switch. The same tunnels that work fine if they are the first ones brought up.

The portable creates more and more ikev2 SA's all "IN-NEG" whereas the headquarters termination of the tunnel shows one that is "READY" and keeps creating new VirtualAccessN devices over and over as one goes down and another comes up. I have seen some variation on this theme, but that is the general scenario.

The termination router (static address) gives errors (debug crypto ikev2 error) like these:

Sep 21 15:15:50.383: IKEv2:: Packet is a retransmission
Sep 21 15:15:54.287: IKEv2:: Packet is a retransmission
Sep 21 15:15:59.075: IKEv2:Couldn't find matching SA: Detected an invalid IKE SPI
Sep 21 15:15:59.075: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:00.643: IKEv2:Couldn't find matching SA: Detected an invalid IKE SPI
Sep 21 15:16:00.643: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:02.267: IKEv2:Couldn't find matching SA: Could not find neg context
Sep 21 15:16:02.267: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:15.487: IKEv2:Couldn't find matching SA: Detected an invalid IKE SPI
Sep 21 15:16:15.487: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:17.423: IKEv2:Couldn't find matching SA: Could not find neg context
Sep 21 15:16:17.423: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:18.395: IKEv2:Failed to retrieve Certificate Issuer list
Sep 21 15:16:18.399: IKEv2:Failed to retrieve Certificate Issuer list
Sep 21 10:16:18.415 cdt: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access2, changed state to down
Sep 21 10:16:18.419 cdt: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access1, changed state to down
Sep 21 10:16:18.419 cdt: %LINK-3-UPDOWN: Interface Virtual-Access1, changed state to down
Sep 21 15:16:18.427: IKEv2:Error constructing config reply
Sep 21 10:16:18.431 cdt: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access2, changed state to up

The Portable1 router for the same debug gives:

*Sep 21 10:18:06.219 cdt: %FLEXVPN-6-FLEXVPN_CONNECTION_DOWN: FlexVPN(FLEXVPN_IKEV2_CLIENT) Client_public_addr = 172.16.1.100 Server_public_addr = x.x.x.x (currect address) 
*Sep 21 15:19:01.703: IKEv2-ERROR:(SESSION ID = 1,SA ID = 3):: Maximum number of retransmissions reached
*Sep 21 15:19:01.703: IKEv2-ERROR:(SESSION ID = 1,SA ID = 3):: Auth exchange failed

The authentication is PSK and is correct -- it works fine if it's the first tunnel up (in either case). It is as though there is some mis-match during the transition, as though it's using information from the past tunnel in some fashion and not connecting properly at each end (or maybe just at one end). SHOW CRYPTO IKEV2 SA and SHOW CRYPTO IPSEC SA both have all the right IP's, and seem to line up, but the IN NEG never completes.

Again, will post a followup with actual configs redacted a bit, here's the core of the tunnel code that chooses destination:

crypto ikev2 keyring IKEV2_KEYRING
peer TUNNEL_PEERS
address x.x.x.x
pre-shared-key redacted
!
crypto ikev2 profile IKEV2_PROFILE
match identity remote any
authentication local pre-share
authentication remote pre-share
keyring local IKEV2_KEYRING
!
crypto ikev2 client flexvpn FLEXVPN_IKEV2_CLIENT
peer 1 x.x.x.x
source 1 GigabitEthernet0 track 7
source 2 Cellular0 track 6
client connect Tunnel31

The issue is during a changeover in the tracks. If 6 is up and 7 down on boot - works. If 7 is up and 6 is down on boot - works. But start one way, and switch -- fails. This would seem to eliminate almost everything - NAT, routing, PSK, etc., since each works in one state. It's something about the changeover.

Is there some issue with using FLEXVPN and changing the source dynamically with TRACK as above?

Thanks for any insight.

Linwood

cco@leferguson.com · ‎09-22-2018

OK, I think I understand this and have it fixed. It's a bit weird (or seems so to me). First what I see with a lot of detail shown in debug.

With the Gi0 interface up at 172.16.1.10, packets are going out without NAT, e.g.

Sep 22 13:03:18.507: IP: s=172.16.1.100 (local), d=X.X.X.X (GigabitEthernet0), len 128, output feature
Sep 22 13:03:18.507:     UDP src=4500, dst=4500, Post-routing NAT Outside(26), rtype 1, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE

I then remove the IP address from Gi0 to force a failure. Before the cellular link is brought up I get this, as expected:

Sep 22 13:03:23.552: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X
Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 process level forwarding
Sep 22 13:03:23.552: FIBfwd-proc: depth 0 first_idx 0 paths 1 long 0(0)
Sep 22 13:03:23.552: FIBfwd-proc: try path 0 (of 1) v4-sp first short ext 0(-1)
Sep 22 13:03:23.552: FIBfwd-proc: v4-sp valid
Sep 22 13:03:23.552: FIBfwd-proc:  no nh type 8  - deag
Sep 22 13:03:23.552: FIBfwd-proc: ip_pak_table 0 ip_nh_table 65535 if none nh none deag 1 chg_if 0 via fib 0 path type special prefix
Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 not enough info to forward via fib (none none)
Sep 22 13:03:23.552: FIBipv4-packet-proc: packet routing failed

This attempt continues, however, I assume originating from the now dying sa. The cellular comes up (10.96.68.104) and the track activates from its continual pings, which installs a route to the cellular as default. All correct. But this SA that's still sending

Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=X.X.X.X, len 108, local feature
Sep 22 13:05:21.260: UDP src=4500, dst=4500, Policy Routing(3), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:21.260: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X
Sep 22 13:05:21.260: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0
Sep 22 13:05:21.260: FIBipv4-packet-proc: packet routing succeeded
Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=66.158.37.6 (Cellular0), len 108, sending
Sep 22 13:05:21.260: UDP src=4500, dst=4500
Sep 22 13:05:21.260: NAT: s=172.16.1.100->10.96.68.104, d=66.158.37.6 [17162]

Now it sends through the cellular, and in doing so caches a translation that ties the cellular address to the now defunct ISP address.

#show ip nat trans
Pro Inside global         Inside local          Outside local         Outside global
udp 10.96.68.104:4500     172.16.1.100:4500     X.X.X.X:4500         X.X.X.X:4500

This causes inbound packets that SHOULD be delivered to the new tunnel's Cellular address to be translated inappropriately back to the defunct ISP address, since they use the same port.

Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.738:     UDP src=4500, dst=4500, Common Flow Table(5), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.738:     UDP src=4500, dst=4500, Stateful Inspection(8), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.738:     UDP src=4500, dst=4500, Dialer i/f override(25), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, Virtual Fragment Reassembly(39), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, Virtual Fragment Reassembly After IPSec Decryption(57), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: NAT: s=X.X.X.X, d=10.96.68.104->172.16.1.100 [56943]
Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, NAT Outside(92), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, MCI Check(109), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: FIBipv4-packet-proc: route packet from Cellular0 src X.X.X.X dst 172.16.1.100
Sep 22 13:05:26.742: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0
Sep 22 13:05:26.742: FIBipv4-packet-proc: packet routing succeeded

I'm not that strong in NAT and routing to know if this is what should happen, if I have a bad configuration somewhere that's causing it, if it's a bug, or feature. Clearing NAT translations fixes it, but the timing is tough -- the traffic originating at the defunct IP address continues for a long time, and you have to keep clearing it (or wait a long time, like minutes). DPD is either not happening, or is not happening fast enough, or it's happening at the wrong end... not sure.

Since both WAN interfaces are negotiated it's not possible to use ACL's in the NAT to exclude them. I'm using a route map now with an access list to NOT match internal to internal addresses, that does match internal to external. Unfortunately both the cellular and ISP interfaces are internal private addresses (out of our control).

What seems to fix it is to extend the access list that is in the route map so that it explicitly denies any match on the destination tunnel peer (e.g. host x.x.x.x any and any host x.x.x.x). This causes the route map for the NAT statement to fail, and any traffic to the destination peer is not translated. This means to ping it I cannot ping from an internal address, but that's a minor detail (I guess I could explicitly allow on-router private addresses but that's more complexity).

Anyway, so far with that added, no NAT cache occurs, and the transition occurs properly. I've done it twice now, after some breakfast will test more extensively, but I think this is going to be the fix. BTW I had a NAT statement on the Tunnel interface which I removed as well, though not sure that had any impact (I removed it long before this fix and it still failed the same way).

One other quirk -- changing the access list did not work at first, it required a reboot. Not sure why, but I THINK something about the NAT or route map evaluation had cached the access list and did not recognize a changed version. After a reboot it seemed to work properly -- proof that Cisco was really out to mess with my head on this one.

It's also worth mentioning that none of this becomes a real issue if you have external addresses not in your NAT list for the WAN ports, so static ports with public addresses would not have any issues nor would private addresses already excluded from your NAT by access list. However, in this case we know we don't control either the ISP addresses nor cellular addresses, so we left them implicitly in. This also means we might get a case with an actual conflict of a necessary address (e.g. a local ISP with the same subnet as an internal site we need to route to). There's not a lot we can do about that, I think.

Anyway, I'll update this later if I find more, but I think that's the answer. If anyone knows whether the NAT cache it did in the list above is right I'd love to know. I would have hoped that the removal of the IP from the interface would somehow stop either the SA from using it, or the routing engine from routing, or NAt from NAT'ing. But all happily kept using it.

Thanks for all the help.

Linwood

View solution in original post

cco@leferguson.com · ‎09-21-2018

Here are the configs. I've removed some irrelevant stuff for brevity, and redacted some stuff, but I realize these are still long and complex, which is why I tried to frame the original question more conceptually. But for those who like to see all the code... here it is. I've omitted things like the VLAN's which attach to local and other devices, etc., which play no role in the tunnels.

And yes I've love a simpler approach to this, but it has to accommodate both EIGRP and two dhcp/negotiated tunnel source addresses, and failover between.

Tunnel (headquarters) router:

vtp mode transparent
!
crypto ikev2 proposal IKEV2_PROP 
 encryption aes-cbc-256
 prf sha512
 integrity sha1
 group 2
!
crypto ikev2 policy IKEV2_POLICY 
 proposal IKEV2_PROP
!
crypto ikev2 keyring IKEV2_KEYRING
 peer TUNNEL_PEERS
  address 0.0.0.0 0.0.0.0
  pre-shared-key REDACTED
 !
!
crypto ikev2 profile IKEV2_PROFILE
 match identity remote any
 authentication remote pre-share
 authentication local pre-share
 keyring local IKEV2_KEYRING
 virtual-template 1
!
vlan 2101 
!
vlan 2137
 name toASA
lldp run
!
crypto isakmp policy 100
 encr aes 256
 authentication pre-share
 group 2
!
crypto ipsec transform-set VTISET esp-aes 256 esp-sha-hmac 
 mode tunnel
!
crypto ipsec profile IKEV2_IPSEC_PROFILE
 set security-association lifetime kilobytes 200000
 set security-association lifetime seconds 1800
 set transform-set VTISET 
 set pfs group2
 set ikev2-profile IKEV2_PROFILE
!
interface Loopback39
 description Interface used to form tunnel
 ip address 172.25.39.5 255.255.255.0
!
interface GigabitEthernet7
 description Internet DMZ connection to ASA firewall specifically for tunnels
 switchport access vlan 2137
 no ip address

interface Virtual-Template1 type tunnel
 ip unnumbered Loopback39
 tunnel source Vlan2137
 tunnel mode ipsec ipv4
 tunnel destination dynamic
 tunnel protection ipsec profile IKEV2_IPSEC_PROFILE
!
interface Vlan2137
 description Bogus VLAN because this router won't let me have a non-switch port
 ip address 172.26.38.5 255.255.255.0
!
router eigrp 10
 distribute-list 98 in 
 distribute-list 98 out 
 network 10.0.0.0
 network 172.16.0.0 0.15.255.255
 network 192.168.0.0 0.0.63.255
!
ip route 0.0.0.0 0.0.0.0 172.26.38.10

access-list 98 deny   0.0.0.0
access-list 98 deny   128.0.0.0
access-list 98 permit any

---------------------------------
Portable1 Router:
---------------------------------

ethernet lmi ce

ip dhcp excluded-address 172.25.36.31
!
ip dhcp pool PORTABLE_POOL
 network 172.25.36.0 255.255.255.0
 lease 0 0 5

ip cef
!
multilink bundle-name authenticated
!
chat-script lte "" "AT!CALL" TIMEOUT 20 "OK"
!
license udi pid C819G-LTE-MNA-K9 sn REDACTED
!
no spanning-tree vlan 1922
no spanning-tree vlan 2139
vtp mode transparent
!
crypto ikev2 proposal IKEV2_PROP 
 encryption aes-cbc-256
 prf sha512
 integrity sha1
 group 2
!
crypto ikev2 policy IKEV2_POLICY 
 proposal IKEV2_PROP
!
crypto ikev2 keyring IKEV2_KEYRING
 peer TUNNEL_PEERS
  address x.x.x.x          <<<< public address of ASA's interface that NAT's to tunnel router above
  pre-shared-key REDACTED
 !
!
crypto ikev2 profile IKEV2_PROFILE
 match identity remote any
 authentication local pre-share
 authentication remote pre-share
 keyring local IKEV2_KEYRING
!
crypto ikev2 client flexvpn FLEXVPN_IKEV2_CLIENT
  peer 1 x.x.x.x              <<< Same as x.x.x.x above
  source 1 GigabitEthernet0 track 7
  source 2 Cellular0 track 6
  client connect Tunnel31
!
controller Cellular 0
 lte sim data-profile 14 attach-profile 14 slot 0
 lte sim data-profile 14 attach-profile 14 slot 1
 lte modem link-recovery rssi onset-threshold -110
 lte modem link-recovery monitor-timer 20
 lte modem link-recovery wait-timer 10
 lte modem link-recovery debounce-count 6
!
track 6 ip sla 6 reachability
 delay down 30 up 30
!
track 7 ip sla 7 reachability
 delay down 30 up 30
!
crypto isakmp invalid-spi-recovery
!
crypto ipsec transform-set VTISET esp-aes 256 esp-sha-hmac 
 mode tunnel
!
crypto ipsec profile IKEV2_IPSEC_PROFILE
 set security-association lifetime kilobytes 200000
 set security-association lifetime seconds 1800
 set transform-set VTISET 
 set pfs group2
 set ikev2-profile IKEV2_PROFILE
!
interface Loopback39
 description Interface used to form tunnel
 ip address 172.25.39.1 255.255.255.0
!
interface Tunnel31
 ip unnumbered Loopback39
 ip nat inside
 ip virtual-reassembly in
 tunnel source dynamic
 tunnel mode ipsec ipv4
 tunnel destination dynamic
 tunnel protection ipsec profile IKEV2_IPSEC_PROFILE
!
interface Cellular0
 description Connection to GSM modem (secondary internet if primary down)
 ip dhcp client lease 0 0 5
 ip address negotiated
 ip nat outside
 ip virtual-reassembly in
 encapsulation slip
 dialer in-band
 dialer string lte
 dialer-group 1
 async mode interactive
!
interface GigabitEthernet0
 description Connection to local intenet (if any)
 ip dhcp client default-router distance 100
 ip dhcp client lease 0 0 5
 ip address dhcp
 ip nat outside
 ip virtual-reassembly in
 shutdown
 duplex full
 speed auto
!
interface Serial0
 no ip address
 shutdown
 clock rate 2000000
!
router eigrp 10
 distribute-list 98 in 
 distribute-list 98 out 
 network 10.0.0.0
 network 172.16.0.0 0.15.255.255
 network 192.168.0.0 0.0.63.255
 passive-interface GigabitEthernet0
 passive-interface Cellular0
 passive-interface Cellular1
!
ip local policy route-map CHOOSE-ISP
ip forward-protocol nd
ip nat inside source route-map ISP6 interface Cellular0 overload
ip nat inside source route-map ISP7 interface GigabitEthernet0 overload
ip route 0.0.0.0 0.0.0.0 Cellular0 100 track 6

ip access-list extended INSIDE-TO-INSIDE-DENY-ACL
 deny   ip 10.0.0.0 0.255.255.255 10.0.0.0 0.255.255.255
 deny   ip 10.0.0.0 0.255.255.255 192.168.0.0 0.0.63.255
 deny   ip 10.0.0.0 0.255.255.255 172.16.0.0 0.15.255.255
 deny   ip 192.168.0.0 0.0.63.255 10.0.0.0 0.255.255.255
 deny   ip 192.168.0.0 0.0.63.255 192.168.0.0 0.0.63.255
 deny   ip 192.168.0.0 0.0.63.255 172.16.0.0 0.15.255.255
 deny   ip 172.16.0.0 0.15.255.255 10.0.0.0 0.255.255.255
 deny   ip 172.16.0.0 0.15.255.255 192.168.0.0 0.0.63.255
 deny   ip 172.16.0.0 0.15.255.255 172.16.0.0 0.15.255.255
 permit ip 10.0.0.0 0.255.255.255 any
 permit ip 192.168.0.0 0.0.63.255 any
 permit ip 172.16.0.0 0.15.255.255 any
 permit ip any 172.0.0.0 0.0.255.255
!
ip sla 6
 icmp-echo 8.8.8.8 source-interface Cellular0      <<<< No particular reason to use different track addresses other than easier to noitice in debug
 tag Cellular Connection up test
 frequency 10
ip sla schedule 6 life forever start-time now
ip sla 7
 icmp-echo 75.75.75.75 source-interface GigabitEthernet0
 tag Local ISP connection up test
 frequency 10
ip sla schedule 7 life forever start-time now
dialer-list 1 protocol ip permit
!
route-map CHOOSE-ISP permit 10
 match ip address 106
 set interface Cellular0
!
route-map CHOOSE-ISP permit 20
 match ip address 107
 set interface GigabitEthernet0
!
route-map ISP6 permit 10
 match ip address INSIDE-TO-INSIDE-DENY-ACL
 match interface Cellular0
!
route-map ISP7 permit 10
 match ip address INSIDE-TO-INSIDE-DENY-ACL
 match interface GigabitEthernet0
!
access-list 6 permit 8.8.8.8
access-list 7 permit 75.75.75.75
access-list 98 deny   0.0.0.0
access-list 98 deny   128.0.0.0   <<<<< Long irrelevant story but this is distributed as a static route in EIGRP but shouldn't be on this router
access-list 98 permit any
access-list 106 permit ip any host 8.8.8.8
access-list 107 permit ip any host 75.75.75.75
!
control-plane
!
line 3                           <<< Not sure why line3 goes with cell 0 but it does.
 script dialer lte
 modem InOut
 no exec
event manager applet TURN-GSM-OFF-IF-INTERNET-UP
 description Force cellular internet off anytime the local ISP is up and tracking
 event track 7 state up
 action 1.0 cli command "enable"
 action 2.0 cli command "config t"
 action 3.0 cli command "interface Cell 0"
 action 4.0 cli command "shutdown"
 action 5.0 syslog msg "Local ISP (track 7) up, turned Interface Cellular 0 to reduce cost (also clears routes)"
 action 6.0 cli command "end"
event manager applet TURN-GSM-ON-IF-INTERNET-DOWN
 description Force cellular internet on anytime the local ISP is down
 event track 7 state down
 action 1.0 cli command "enable"
 action 2.0 cli command "config t"
 action 3.0 cli command "interface Cell 0"
 action 4.0 cli command "no shutdown"
 action 5.0 syslog msg "Local ISP (track 7) down, turned Interface Cellular 0  on to enable backup"
 action 6.0 cli command "end"
event manager applet CLEAR-LOCAL-ISP-BRIEFLY-ON-FAILURE
 description Force local internet off for a while (to clear routes) if it fails, falling back on cellular
 event track 7 state down maxrun 2000
 action 1.0 cli command "enable"
 action 2.0 cli command "config t"
 action 3.0 cli command "interface GigabitEthernet 0"
 action 4.0 cli command "shutdown"
 action 4.5 syslog msg "Cycled Interface GigabitEthernet 0 down to clear routes after track failed"
 action 5.0 wait 900
 action 6   cli command "no shutdown"
 action 6.5 cli command "end"
 action 7.0 syslog msg "Cycled interface GigabitEthernet 0 back on to try again"
event manager applet TICKLE-GSM-PERIODICALLY
 description The SLA won't cause interesting traffic so we need to hit it manually occasionally (this does nothing when down)
 event timer watchdog time 30
 action 0.7 cli command "enable"
 action 1.0 cli command "ping 8.8.8.8"
!
end

Daniel Lucas · ‎09-21-2018

I will try lab'ing this up to see if I can get it working.
One thought though is have you tried creating a separate flexvpn client instance, and having it only connect if the track for the primary interface is down?
So it would be something like:
crypto ikev2 client flexvpn FLEXVPN_IKEV2_CLIENT_PRIMARY
peer 1 x.x.x.x
source 1 GigabitEthernet0
client connect Tunnel31
!
crypto ikev2 client flexvpn FLEXVPN_IKEV2_CLIENT_BACKUP
peer 1 x.x.x.x
source 1 Cellular0
connect track 7 down
client connect Tunnel31
!

cco@leferguson.com · ‎09-21-2018

Thanks. I haven't, that may be better, and in fact I could probably then use a whole separate tunnel interface and IP address which might offer yet more segregation.

FWIW I tried this in GNS 3 about 6 months ago, and had it working fine, except that the cellular was quite different in how it handles addresses (default gateway isn't inherited the same way). But the switchover was working just fine. But simulation is quite different than reality. This like 800 router is a bit odd.

Rob Ingram · ‎09-21-2018

Hi, My first thought would be that you required DPD in order to clear the old SAs.

cco@leferguson.com · ‎09-21-2018

@Rob Ingram wrote:
Hi, My first thought would be that you required DPD in order to clear the old SAs.

Hmmm... The issue with that is the SA's look like new ones (they have the new address) but it's easy enough to try. It will take me a bit - I'm in the middle of a different (but similar) experiment, I'm turning off the tunnel on each transition for 60 seconds, to see if that forces things to clear out.

I have also manually done a clear crypto ikev2 sa fast (or whatever that syntax was) on each router, once it gets 'stuck' in this mode, and it has no beneficial impact - the ikev2 sa's go away as expected, but come back with this same weird mode. I realize ikev2 sa and dpd operate on a slightly different level (though not sure how related it is to clearing them).

I'll give it a try in an hour or so and update.

cco@leferguson.com · ‎09-21-2018

I added this to both routers, is there more that would be needed?

crypto isakmp keepalive 30 10 periodic

It had no effect that I can see.

I also added some event manager code so each time there is a change, it shuts the tunnel interface down for one minute to give things time to clear (not sure if that's long enough). That had no effect either.

Rob Ingram · ‎09-21-2018

Try "dpd 10 2 on-demand" either global or under the IKEv2 profile. You can determine it's working by running a "debug crypto ikev2" and look for R U There messages.

EDIT - global command is "crypto ikev2 dpd 10 2 on-demand"

cco@leferguson.com · ‎09-21-2018

Not sure that changed anything; I am not seeing the messages from them, but then again on-demand is only if no data, and I'm sending a lot of pings already to make sure things are up.

With debug on for ikev2 I see a lot of retransmitting from both sides for payload or ENCR. Eventually it hist max retransmissions, negotiates again (during which it appears to receive responses), then repeats.

I'm starting to wonder if I have a NAT issue of some sort. Everything here is NAT'd twice. The termination router is inside an ASA that's NAT'd, the Portable router is doing NAT itself, the cellular network is providing a NAT'd address also. But I'm starting to wonder if somewhere in that I'm getting either a port reused that is mis-directing a packet to the prior translation. I'm starting to dig there a bit, try to find why packets from one end aren't making it to the other (if that's really what's happening). That might explain why a reboot -- and the ensuring lack of traffic for a period -- might clear a PAT entry.

cco@leferguson.com · ‎09-21-2018

OK, that was pointless, I set it up ages ago. it's a static 1:1 nat so no PAT involved.

I'm worried I need a sniffer somewhere to see where the packets are (not) going.

Rob Ingram · ‎09-21-2018

cco@leferguson.com wrote:

I'm worried I need a sniffer somewhere to see where the packets are (not) going.

In your first you confirmed the Hub was creating VA interfaces over and over, so therefore I'd conclude the packets are going to the hub.

I've found previously when tracking an interface with FlexVPN to have a down and up delay of say 60 seconds, combined with dpd (my notes indicated I used periodic rather than on-demand). I could flip between shutting down the primary interface, automatically establishing a tunnel on the secondary and then no shut the primary interface and re-establish a tunnel without issue.

cco@leferguson.com · ‎09-21-2018

cco@leferguson.com wrote:

I'm worried I need a sniffer somewhere to see where the packets are (not) going.

OK, once I remembered to turn off the route cache, I could see the packets, there is some kind of addressing problem. Here's an example:

Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Common Flow Table(5), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Stateful Inspection(8), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Dialer i/f override(25), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Virtual Fragment Reassembly(39), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Virtual Fragment Reassembly After IPSec Decryption(57), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, NAT Outside(92), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, MCI Check(109), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: FIBipv4-packet-proc: route packet from Cellular0 src X.X.X.X dst 172.16.1.100
Sep 21 21:55:38.941: FIBfwd-proc: Default:0.0.0.0/0 process level forwarding
Sep 21 21:55:38.941: FIBfwd-proc: depth 0 first_idx 0 paths 1 long 0(0)
Sep 21 21:55:38.941: FIBfwd-proc: try path 0 (of 1) v4-sp first short ext 0(-1)
Sep 21 21:55:38.941: FIBfwd-proc: v4-sp valid
Sep 21 21:55:38.941: FIBfwd-proc:  no nh type 8  - deag
Sep 21 21:55:38.941: FIBfwd-proc: ip_pak_table 0 ip_nh_table 65535 if none nh none deag 1 chg_if 0 via fib 0 path type special prefix
Sep 21 21:55:38.941: FIBfwd-proc: Default:0.0.0.0/0 not enough info to forward via fib (none none)
Sep 21 21:55:38.941: FIBipv4-packet-proc: packet routing failed
Sep

Now here's the thing.... the X.X.X.X is the headquarters outside IP, so is right. The 10.192.67.219 is the cellular address of the moment.

Part way thru notice the destination changes from 10.192.67.219 to 172.16.1.100. That latter WAS the IP address of the ethernet (ISP) interface that at this moment is not active. The real destination should be, I think, the tunnel IP? That would be 172.25.39.1 which is on the Tunnel Interface as the Lookback (for no-ip usage).

At this point I can see:

Portable1#show ip nat trans  global
Pro Inside global         Inside local          Outside local         Outside global
udp 10.192.67.219:4500    172.16.1.100:4500     X.X.X.X:4500          X.X.X.X:4500

I've got two NAT statements:

ip nat inside source route-map ISP6 interface Cellular0 overload
ip nat inside source route-map ISP7 interface GigabitEthernet0 overload

the route maps above are:

route-map ISP6 permit 10
 match ip address INSIDE-TO-INSIDE-DENY-ACL
 match interface Cellular0
!
route-map ISP7 permit 10
 match ip address INSIDE-TO-INSIDE-DENY-ACL
 match interface GigabitEthernet0

Ignoring for the moment the ip address, as I understand it ISP6 should have been chosen for translation of data from cellular 0, but it is using the translation in the second NAT -- more to the point it's using a translation that is to an address that no longer even exists. Worse squared: At no time did 10.192.67.219 (the cellular interface) ever communicate with 172.16.1.100 (the ethernet interface to the ISP). Both have IP NAT OUTSIDE on them.

So I'm at a real loss what i am doing, but the issue appears to be these two NAT statements and the route map source for them. Which does work once -- but not after a change.

But I feel like I'm getting closer now.

cco@leferguson.com · ‎09-21-2018

Yes, the issue is definitely that NAT is getting a bad cached value. During the transition, for reasons I cannot explain, I get a local inside address if the turned-admin-down interface, and inside global address of the newly up WAN interface. E.g.

#show ip nat trans
Pro Inside global         Inside local          Outside local         Outside global
udp 172.16.1.100:4500     10.96.68.104:4500     X.X.X.X:4500          X.X.X.X:4500

That first address is the Gig 0 interface that just came up with an ISP's DHCP address, and the 10.96.68.104 address is the no-longer-active address of the Cellular 0 interface that was just shut down. And traffic just keeps on keeping it alive. If I clear ip nat trans * then the tunnel immediately comes up and stays up.

I've been partially successful in a kludge of doing the clear in event manager on line translation but not completely, the translation sticks around a long time. And I don't want to clear nat translations too long after as that will interfere with unrelated internet access through NAT for real.

What I'd really like to know is how I'm getting a nat translation so crossed up as this -- from one WAN (outside) port to the other WAN (outside) port. And only while the interfaces are in transition (I'm not quite sure if it's when the new one goes up, or when the old comes down).

I can probably bang away on the event manager kludge of clearing translations and get this to work, but...

Any ideas why it is happening?

cco@leferguson.com · ‎09-22-2018

OK, I think I understand this and have it fixed. It's a bit weird (or seems so to me). First what I see with a lot of detail shown in debug.

With the Gi0 interface up at 172.16.1.10, packets are going out without NAT, e.g.

Sep 22 13:03:18.507: IP: s=172.16.1.100 (local), d=X.X.X.X (GigabitEthernet0), len 128, output feature
Sep 22 13:03:18.507:     UDP src=4500, dst=4500, Post-routing NAT Outside(26), rtype 1, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE

I then remove the IP address from Gi0 to force a failure. Before the cellular link is brought up I get this, as expected:

Sep 22 13:03:23.552: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X
Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 process level forwarding
Sep 22 13:03:23.552: FIBfwd-proc: depth 0 first_idx 0 paths 1 long 0(0)
Sep 22 13:03:23.552: FIBfwd-proc: try path 0 (of 1) v4-sp first short ext 0(-1)
Sep 22 13:03:23.552: FIBfwd-proc: v4-sp valid
Sep 22 13:03:23.552: FIBfwd-proc:  no nh type 8  - deag
Sep 22 13:03:23.552: FIBfwd-proc: ip_pak_table 0 ip_nh_table 65535 if none nh none deag 1 chg_if 0 via fib 0 path type special prefix
Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 not enough info to forward via fib (none none)
Sep 22 13:03:23.552: FIBipv4-packet-proc: packet routing failed

This attempt continues, however, I assume originating from the now dying sa. The cellular comes up (10.96.68.104) and the track activates from its continual pings, which installs a route to the cellular as default. All correct. But this SA that's still sending

Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=X.X.X.X, len 108, local feature
Sep 22 13:05:21.260: UDP src=4500, dst=4500, Policy Routing(3), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:21.260: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X
Sep 22 13:05:21.260: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0
Sep 22 13:05:21.260: FIBipv4-packet-proc: packet routing succeeded
Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=66.158.37.6 (Cellular0), len 108, sending
Sep 22 13:05:21.260: UDP src=4500, dst=4500
Sep 22 13:05:21.260: NAT: s=172.16.1.100->10.96.68.104, d=66.158.37.6 [17162]

Now it sends through the cellular, and in doing so caches a translation that ties the cellular address to the now defunct ISP address.

#show ip nat trans
Pro Inside global         Inside local          Outside local         Outside global
udp 10.96.68.104:4500     172.16.1.100:4500     X.X.X.X:4500         X.X.X.X:4500

This causes inbound packets that SHOULD be delivered to the new tunnel's Cellular address to be translated inappropriately back to the defunct ISP address, since they use the same port.

Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.738:     UDP src=4500, dst=4500, Common Flow Table(5), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.738:     UDP src=4500, dst=4500, Stateful Inspection(8), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.738:     UDP src=4500, dst=4500, Dialer i/f override(25), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, Virtual Fragment Reassembly(39), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, Virtual Fragment Reassembly After IPSec Decryption(57), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: NAT: s=X.X.X.X, d=10.96.68.104->172.16.1.100 [56943]
Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, NAT Outside(92), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature
Sep 22 13:05:26.742:     UDP src=4500, dst=4500, MCI Check(109), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:26.742: FIBipv4-packet-proc: route packet from Cellular0 src X.X.X.X dst 172.16.1.100
Sep 22 13:05:26.742: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0
Sep 22 13:05:26.742: FIBipv4-packet-proc: packet routing succeeded

I'm not that strong in NAT and routing to know if this is what should happen, if I have a bad configuration somewhere that's causing it, if it's a bug, or feature. Clearing NAT translations fixes it, but the timing is tough -- the traffic originating at the defunct IP address continues for a long time, and you have to keep clearing it (or wait a long time, like minutes). DPD is either not happening, or is not happening fast enough, or it's happening at the wrong end... not sure.

Since both WAN interfaces are negotiated it's not possible to use ACL's in the NAT to exclude them. I'm using a route map now with an access list to NOT match internal to internal addresses, that does match internal to external. Unfortunately both the cellular and ISP interfaces are internal private addresses (out of our control).

What seems to fix it is to extend the access list that is in the route map so that it explicitly denies any match on the destination tunnel peer (e.g. host x.x.x.x any and any host x.x.x.x). This causes the route map for the NAT statement to fail, and any traffic to the destination peer is not translated. This means to ping it I cannot ping from an internal address, but that's a minor detail (I guess I could explicitly allow on-router private addresses but that's more complexity).

Anyway, so far with that added, no NAT cache occurs, and the transition occurs properly. I've done it twice now, after some breakfast will test more extensively, but I think this is going to be the fix. BTW I had a NAT statement on the Tunnel interface which I removed as well, though not sure that had any impact (I removed it long before this fix and it still failed the same way).

One other quirk -- changing the access list did not work at first, it required a reboot. Not sure why, but I THINK something about the NAT or route map evaluation had cached the access list and did not recognize a changed version. After a reboot it seemed to work properly -- proof that Cisco was really out to mess with my head on this one.

It's also worth mentioning that none of this becomes a real issue if you have external addresses not in your NAT list for the WAN ports, so static ports with public addresses would not have any issues nor would private addresses already excluded from your NAT by access list. However, in this case we know we don't control either the ISP addresses nor cellular addresses, so we left them implicitly in. This also means we might get a case with an actual conflict of a necessary address (e.g. a local ISP with the same subnet as an internal site we need to route to). There's not a lot we can do about that, I think.

Anyway, I'll update this later if I find more, but I think that's the answer. If anyone knows whether the NAT cache it did in the list above is right I'd love to know. I would have hoped that the removal of the IP from the interface would somehow stop either the SA from using it, or the routing engine from routing, or NAt from NAT'ing. But all happily kept using it.

Thanks for all the help.

Linwood

cco@leferguson.com · ‎09-25-2018

I've been testing this now for a few days and it seems to work, so I think this is the issue -- a need to exclude the tunnel peer (static) address from NAT explicitly, since the cellular and/or isp dynamic addresses are likely to be private and otherwise included.