09-21-2018 09:11 AM - edited 02-21-2020 09:28 PM
This is sadly not a simple thing to describe. Short version: I can't get FLEXVPN to shift from one tunnel to another dynamically even though each works separately.
Longer version: I'll first describe the conceptual problem, and then give some actual detailed configs in the first reply.
We have two 800 series routers, a C819G-LTE-NMA-K9 running 15.4(3r)M1 (called "Portable1") which is used as a portable emergency connection, and a C892FSP-K9 which is used for tunnel termination inside our headquarters, located behind an ASA which passes the tunnels through for decryption on the C892FSP.
The goal is to make the Portable1 a flexible replacement carrier. We have a couple dozen locations with fairly redundant fiber and microwave links between them, but this is a public safety network so as a latch ditch failure mode we have the portable to physically carry to a location and plug into that network (which might be an isolated island of several locations depending on failure mode, or might be a mobile command vehicle). The goal is to let EIGRP figure out what is visible in that island, and connect to the main network and rejoin, in a zero-configuration-change goal. That works great. We have tested it quite a bit.
Further if the problem location happens to have a working ISP (many do) we want to plug into that instead of using cellular, since it is likely faster and lower latency. Again, the goal is no router config changes - it needs to figure it out and do the right thing. And that works, and has been pretty thoroughly tested with both cellular and ISP connections.
Finally (and herein lies the issue), if we are running on cellular, and they plug in an ISP it should shift to it. If later the ISP fails, it should shift to Cellular. That process is failing, but not in the way I would expect -- it shifts tunnels origin interface, but the tunnel will not come up properly, even though the same exact tunnel comes up if it is the first tunnel chose after a reboot. Further each tunnel works fine if interrupted and re-established in place (as opposed to shifting to the other tunnel). It is the change from one tunnel source to the other that fails.
The mechanics of those changes work perfectly, meaning the event manager code and FlexVPN path selection works; the tunnel source changes, the interfaces are shut down (or not) as appropriate, and the routing changes properly - I can log in and test and see all that (from a separate connection). However, when the tunnel source interface changes (and associated IP), the IKEV2 tunnels do not come up after the switch. The same tunnels that work fine if they are the first ones brought up.
The portable creates more and more ikev2 SA's all "IN-NEG" whereas the headquarters termination of the tunnel shows one that is "READY" and keeps creating new VirtualAccessN devices over and over as one goes down and another comes up. I have seen some variation on this theme, but that is the general scenario.
The termination router (static address) gives errors (debug crypto ikev2 error) like these:
Sep 21 15:15:50.383: IKEv2:: Packet is a retransmission
Sep 21 15:15:54.287: IKEv2:: Packet is a retransmission
Sep 21 15:15:59.075: IKEv2:Couldn't find matching SA: Detected an invalid IKE SPI
Sep 21 15:15:59.075: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:00.643: IKEv2:Couldn't find matching SA: Detected an invalid IKE SPI
Sep 21 15:16:00.643: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:02.267: IKEv2:Couldn't find matching SA: Could not find neg context
Sep 21 15:16:02.267: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:15.487: IKEv2:Couldn't find matching SA: Detected an invalid IKE SPI
Sep 21 15:16:15.487: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:17.423: IKEv2:Couldn't find matching SA: Could not find neg context
Sep 21 15:16:17.423: IKEv2:: A supplied parameter is incorrect
Sep 21 15:16:18.395: IKEv2:Failed to retrieve Certificate Issuer list
Sep 21 15:16:18.399: IKEv2:Failed to retrieve Certificate Issuer list
Sep 21 10:16:18.415 cdt: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access2, changed state to down
Sep 21 10:16:18.419 cdt: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access1, changed state to down
Sep 21 10:16:18.419 cdt: %LINK-3-UPDOWN: Interface Virtual-Access1, changed state to down
Sep 21 15:16:18.427: IKEv2:Error constructing config reply
Sep 21 10:16:18.431 cdt: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access2, changed state to up
The Portable1 router for the same debug gives:
*Sep 21 10:18:06.219 cdt: %FLEXVPN-6-FLEXVPN_CONNECTION_DOWN: FlexVPN(FLEXVPN_IKEV2_CLIENT) Client_public_addr = 172.16.1.100 Server_public_addr = x.x.x.x (currect address) *Sep 21 15:19:01.703: IKEv2-ERROR:(SESSION ID = 1,SA ID = 3):: Maximum number of retransmissions reached *Sep 21 15:19:01.703: IKEv2-ERROR:(SESSION ID = 1,SA ID = 3):: Auth exchange failed
The authentication is PSK and is correct -- it works fine if it's the first tunnel up (in either case). It is as though there is some mis-match during the transition, as though it's using information from the past tunnel in some fashion and not connecting properly at each end (or maybe just at one end). SHOW CRYPTO IKEV2 SA and SHOW CRYPTO IPSEC SA both have all the right IP's, and seem to line up, but the IN NEG never completes.
Again, will post a followup with actual configs redacted a bit, here's the core of the tunnel code that chooses destination:
crypto ikev2 keyring IKEV2_KEYRING peer TUNNEL_PEERS address x.x.x.x pre-shared-key redacted ! crypto ikev2 profile IKEV2_PROFILE match identity remote any authentication local pre-share authentication remote pre-share keyring local IKEV2_KEYRING ! crypto ikev2 client flexvpn FLEXVPN_IKEV2_CLIENT peer 1 x.x.x.x source 1 GigabitEthernet0 track 7 source 2 Cellular0 track 6 client connect Tunnel31
The issue is during a changeover in the tracks. If 6 is up and 7 down on boot - works. If 7 is up and 6 is down on boot - works. But start one way, and switch -- fails. This would seem to eliminate almost everything - NAT, routing, PSK, etc., since each works in one state. It's something about the changeover.
Is there some issue with using FLEXVPN and changing the source dynamically with TRACK as above?
Thanks for any insight.
Linwood
Solved! Go to Solution.
09-22-2018 07:08 AM
OK, I think I understand this and have it fixed. It's a bit weird (or seems so to me). First what I see with a lot of detail shown in debug.
With the Gi0 interface up at 172.16.1.10, packets are going out without NAT, e.g.
Sep 22 13:03:18.507: IP: s=172.16.1.100 (local), d=X.X.X.X (GigabitEthernet0), len 128, output feature Sep 22 13:03:18.507: UDP src=4500, dst=4500, Post-routing NAT Outside(26), rtype 1, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
I then remove the IP address from Gi0 to force a failure. Before the cellular link is brought up I get this, as expected:
Sep 22 13:03:23.552: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 process level forwarding Sep 22 13:03:23.552: FIBfwd-proc: depth 0 first_idx 0 paths 1 long 0(0) Sep 22 13:03:23.552: FIBfwd-proc: try path 0 (of 1) v4-sp first short ext 0(-1) Sep 22 13:03:23.552: FIBfwd-proc: v4-sp valid Sep 22 13:03:23.552: FIBfwd-proc: no nh type 8 - deag Sep 22 13:03:23.552: FIBfwd-proc: ip_pak_table 0 ip_nh_table 65535 if none nh none deag 1 chg_if 0 via fib 0 path type special prefix Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 not enough info to forward via fib (none none) Sep 22 13:03:23.552: FIBipv4-packet-proc: packet routing failed
This attempt continues, however, I assume originating from the now dying sa. The cellular comes up (10.96.68.104) and the track activates from its continual pings, which installs a route to the cellular as default. All correct. But this SA that's still sending
Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=X.X.X.X, len 108, local feature
Sep 22 13:05:21.260: UDP src=4500, dst=4500, Policy Routing(3), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:21.260: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X
Sep 22 13:05:21.260: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0
Sep 22 13:05:21.260: FIBipv4-packet-proc: packet routing succeeded
Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=66.158.37.6 (Cellular0), len 108, sending
Sep 22 13:05:21.260: UDP src=4500, dst=4500
Sep 22 13:05:21.260: NAT: s=172.16.1.100->10.96.68.104, d=66.158.37.6 [17162]
Now it sends through the cellular, and in doing so caches a translation that ties the cellular address to the now defunct ISP address.
#show ip nat trans Pro Inside global Inside local Outside local Outside global udp 10.96.68.104:4500 172.16.1.100:4500 X.X.X.X:4500 X.X.X.X:4500
This causes inbound packets that SHOULD be delivered to the new tunnel's Cellular address to be translated inappropriately back to the defunct ISP address, since they use the same port.
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.738: UDP src=4500, dst=4500, Common Flow Table(5), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.738: UDP src=4500, dst=4500, Stateful Inspection(8), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.738: UDP src=4500, dst=4500, Dialer i/f override(25), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, Virtual Fragment Reassembly(39), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, Virtual Fragment Reassembly After IPSec Decryption(57), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: NAT: s=X.X.X.X, d=10.96.68.104->172.16.1.100 [56943] Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, NAT Outside(92), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, MCI Check(109), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: FIBipv4-packet-proc: route packet from Cellular0 src X.X.X.X dst 172.16.1.100 Sep 22 13:05:26.742: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0 Sep 22 13:05:26.742: FIBipv4-packet-proc: packet routing succeeded
I'm not that strong in NAT and routing to know if this is what should happen, if I have a bad configuration somewhere that's causing it, if it's a bug, or feature. Clearing NAT translations fixes it, but the timing is tough -- the traffic originating at the defunct IP address continues for a long time, and you have to keep clearing it (or wait a long time, like minutes). DPD is either not happening, or is not happening fast enough, or it's happening at the wrong end... not sure.
Since both WAN interfaces are negotiated it's not possible to use ACL's in the NAT to exclude them. I'm using a route map now with an access list to NOT match internal to internal addresses, that does match internal to external. Unfortunately both the cellular and ISP interfaces are internal private addresses (out of our control).
What seems to fix it is to extend the access list that is in the route map so that it explicitly denies any match on the destination tunnel peer (e.g. host x.x.x.x any and any host x.x.x.x). This causes the route map for the NAT statement to fail, and any traffic to the destination peer is not translated. This means to ping it I cannot ping from an internal address, but that's a minor detail (I guess I could explicitly allow on-router private addresses but that's more complexity).
Anyway, so far with that added, no NAT cache occurs, and the transition occurs properly. I've done it twice now, after some breakfast will test more extensively, but I think this is going to be the fix. BTW I had a NAT statement on the Tunnel interface which I removed as well, though not sure that had any impact (I removed it long before this fix and it still failed the same way).
One other quirk -- changing the access list did not work at first, it required a reboot. Not sure why, but I THINK something about the NAT or route map evaluation had cached the access list and did not recognize a changed version. After a reboot it seemed to work properly -- proof that Cisco was really out to mess with my head on this one.
It's also worth mentioning that none of this becomes a real issue if you have external addresses not in your NAT list for the WAN ports, so static ports with public addresses would not have any issues nor would private addresses already excluded from your NAT by access list. However, in this case we know we don't control either the ISP addresses nor cellular addresses, so we left them implicitly in. This also means we might get a case with an actual conflict of a necessary address (e.g. a local ISP with the same subnet as an internal site we need to route to). There's not a lot we can do about that, I think.
Anyway, I'll update this later if I find more, but I think that's the answer. If anyone knows whether the NAT cache it did in the list above is right I'd love to know. I would have hoped that the removal of the IP from the interface would somehow stop either the SA from using it, or the routing engine from routing, or NAt from NAT'ing. But all happily kept using it.
Thanks for all the help.
Linwood
09-21-2018 09:14 AM
vtp mode transparent ! crypto ikev2 proposal IKEV2_PROP encryption aes-cbc-256 prf sha512 integrity sha1 group 2 ! crypto ikev2 policy IKEV2_POLICY proposal IKEV2_PROP ! crypto ikev2 keyring IKEV2_KEYRING peer TUNNEL_PEERS address 0.0.0.0 0.0.0.0 pre-shared-key REDACTED ! ! crypto ikev2 profile IKEV2_PROFILE match identity remote any authentication remote pre-share authentication local pre-share keyring local IKEV2_KEYRING virtual-template 1 ! vlan 2101 ! vlan 2137 name toASA lldp run ! crypto isakmp policy 100 encr aes 256 authentication pre-share group 2 ! crypto ipsec transform-set VTISET esp-aes 256 esp-sha-hmac mode tunnel ! crypto ipsec profile IKEV2_IPSEC_PROFILE set security-association lifetime kilobytes 200000 set security-association lifetime seconds 1800 set transform-set VTISET set pfs group2 set ikev2-profile IKEV2_PROFILE ! interface Loopback39 description Interface used to form tunnel ip address 172.25.39.5 255.255.255.0 ! interface GigabitEthernet7 description Internet DMZ connection to ASA firewall specifically for tunnels switchport access vlan 2137 no ip address interface Virtual-Template1 type tunnel ip unnumbered Loopback39 tunnel source Vlan2137 tunnel mode ipsec ipv4 tunnel destination dynamic tunnel protection ipsec profile IKEV2_IPSEC_PROFILE ! interface Vlan2137 description Bogus VLAN because this router won't let me have a non-switch port ip address 172.26.38.5 255.255.255.0 ! router eigrp 10 distribute-list 98 in distribute-list 98 out network 10.0.0.0 network 172.16.0.0 0.15.255.255 network 192.168.0.0 0.0.63.255 ! ip route 0.0.0.0 0.0.0.0 172.26.38.10 access-list 98 deny 0.0.0.0 access-list 98 deny 128.0.0.0 access-list 98 permit any
ethernet lmi ce ip dhcp excluded-address 172.25.36.31 ! ip dhcp pool PORTABLE_POOL network 172.25.36.0 255.255.255.0 lease 0 0 5 ip cef ! multilink bundle-name authenticated ! chat-script lte "" "AT!CALL" TIMEOUT 20 "OK" ! license udi pid C819G-LTE-MNA-K9 sn REDACTED ! no spanning-tree vlan 1922 no spanning-tree vlan 2139 vtp mode transparent ! crypto ikev2 proposal IKEV2_PROP encryption aes-cbc-256 prf sha512 integrity sha1 group 2 ! crypto ikev2 policy IKEV2_POLICY proposal IKEV2_PROP ! crypto ikev2 keyring IKEV2_KEYRING peer TUNNEL_PEERS address x.x.x.x <<<< public address of ASA's interface that NAT's to tunnel router above pre-shared-key REDACTED ! ! crypto ikev2 profile IKEV2_PROFILE match identity remote any authentication local pre-share authentication remote pre-share keyring local IKEV2_KEYRING ! crypto ikev2 client flexvpn FLEXVPN_IKEV2_CLIENT peer 1 x.x.x.x <<< Same as x.x.x.x above source 1 GigabitEthernet0 track 7 source 2 Cellular0 track 6 client connect Tunnel31 ! controller Cellular 0 lte sim data-profile 14 attach-profile 14 slot 0 lte sim data-profile 14 attach-profile 14 slot 1 lte modem link-recovery rssi onset-threshold -110 lte modem link-recovery monitor-timer 20 lte modem link-recovery wait-timer 10 lte modem link-recovery debounce-count 6 ! track 6 ip sla 6 reachability delay down 30 up 30 ! track 7 ip sla 7 reachability delay down 30 up 30 ! crypto isakmp invalid-spi-recovery ! crypto ipsec transform-set VTISET esp-aes 256 esp-sha-hmac mode tunnel ! crypto ipsec profile IKEV2_IPSEC_PROFILE set security-association lifetime kilobytes 200000 set security-association lifetime seconds 1800 set transform-set VTISET set pfs group2 set ikev2-profile IKEV2_PROFILE ! interface Loopback39 description Interface used to form tunnel ip address 172.25.39.1 255.255.255.0 ! interface Tunnel31 ip unnumbered Loopback39 ip nat inside ip virtual-reassembly in tunnel source dynamic tunnel mode ipsec ipv4 tunnel destination dynamic tunnel protection ipsec profile IKEV2_IPSEC_PROFILE ! interface Cellular0 description Connection to GSM modem (secondary internet if primary down) ip dhcp client lease 0 0 5 ip address negotiated ip nat outside ip virtual-reassembly in encapsulation slip dialer in-band dialer string lte dialer-group 1 async mode interactive ! interface GigabitEthernet0 description Connection to local intenet (if any) ip dhcp client default-router distance 100 ip dhcp client lease 0 0 5 ip address dhcp ip nat outside ip virtual-reassembly in shutdown duplex full speed auto ! interface Serial0 no ip address shutdown clock rate 2000000 ! router eigrp 10 distribute-list 98 in distribute-list 98 out network 10.0.0.0 network 172.16.0.0 0.15.255.255 network 192.168.0.0 0.0.63.255 passive-interface GigabitEthernet0 passive-interface Cellular0 passive-interface Cellular1 ! ip local policy route-map CHOOSE-ISP ip forward-protocol nd ip nat inside source route-map ISP6 interface Cellular0 overload ip nat inside source route-map ISP7 interface GigabitEthernet0 overload ip route 0.0.0.0 0.0.0.0 Cellular0 100 track 6 ip access-list extended INSIDE-TO-INSIDE-DENY-ACL deny ip 10.0.0.0 0.255.255.255 10.0.0.0 0.255.255.255 deny ip 10.0.0.0 0.255.255.255 192.168.0.0 0.0.63.255 deny ip 10.0.0.0 0.255.255.255 172.16.0.0 0.15.255.255 deny ip 192.168.0.0 0.0.63.255 10.0.0.0 0.255.255.255 deny ip 192.168.0.0 0.0.63.255 192.168.0.0 0.0.63.255 deny ip 192.168.0.0 0.0.63.255 172.16.0.0 0.15.255.255 deny ip 172.16.0.0 0.15.255.255 10.0.0.0 0.255.255.255 deny ip 172.16.0.0 0.15.255.255 192.168.0.0 0.0.63.255 deny ip 172.16.0.0 0.15.255.255 172.16.0.0 0.15.255.255 permit ip 10.0.0.0 0.255.255.255 any permit ip 192.168.0.0 0.0.63.255 any permit ip 172.16.0.0 0.15.255.255 any permit ip any 172.0.0.0 0.0.255.255 ! ip sla 6 icmp-echo 8.8.8.8 source-interface Cellular0 <<<< No particular reason to use different track addresses other than easier to noitice in debug tag Cellular Connection up test frequency 10 ip sla schedule 6 life forever start-time now ip sla 7 icmp-echo 75.75.75.75 source-interface GigabitEthernet0 tag Local ISP connection up test frequency 10 ip sla schedule 7 life forever start-time now dialer-list 1 protocol ip permit ! route-map CHOOSE-ISP permit 10 match ip address 106 set interface Cellular0 ! route-map CHOOSE-ISP permit 20 match ip address 107 set interface GigabitEthernet0 ! route-map ISP6 permit 10 match ip address INSIDE-TO-INSIDE-DENY-ACL match interface Cellular0 ! route-map ISP7 permit 10 match ip address INSIDE-TO-INSIDE-DENY-ACL match interface GigabitEthernet0 ! access-list 6 permit 8.8.8.8 access-list 7 permit 75.75.75.75 access-list 98 deny 0.0.0.0 access-list 98 deny 128.0.0.0 <<<<< Long irrelevant story but this is distributed as a static route in EIGRP but shouldn't be on this router access-list 98 permit any access-list 106 permit ip any host 8.8.8.8 access-list 107 permit ip any host 75.75.75.75 ! control-plane ! line 3 <<< Not sure why line3 goes with cell 0 but it does. script dialer lte modem InOut no exec event manager applet TURN-GSM-OFF-IF-INTERNET-UP description Force cellular internet off anytime the local ISP is up and tracking event track 7 state up action 1.0 cli command "enable" action 2.0 cli command "config t" action 3.0 cli command "interface Cell 0" action 4.0 cli command "shutdown" action 5.0 syslog msg "Local ISP (track 7) up, turned Interface Cellular 0 to reduce cost (also clears routes)" action 6.0 cli command "end" event manager applet TURN-GSM-ON-IF-INTERNET-DOWN description Force cellular internet on anytime the local ISP is down event track 7 state down action 1.0 cli command "enable" action 2.0 cli command "config t" action 3.0 cli command "interface Cell 0" action 4.0 cli command "no shutdown" action 5.0 syslog msg "Local ISP (track 7) down, turned Interface Cellular 0 on to enable backup" action 6.0 cli command "end" event manager applet CLEAR-LOCAL-ISP-BRIEFLY-ON-FAILURE description Force local internet off for a while (to clear routes) if it fails, falling back on cellular event track 7 state down maxrun 2000 action 1.0 cli command "enable" action 2.0 cli command "config t" action 3.0 cli command "interface GigabitEthernet 0" action 4.0 cli command "shutdown" action 4.5 syslog msg "Cycled Interface GigabitEthernet 0 down to clear routes after track failed" action 5.0 wait 900 action 6 cli command "no shutdown" action 6.5 cli command "end" action 7.0 syslog msg "Cycled interface GigabitEthernet 0 back on to try again" event manager applet TICKLE-GSM-PERIODICALLY description The SLA won't cause interesting traffic so we need to hit it manually occasionally (this does nothing when down) event timer watchdog time 30 action 0.7 cli command "enable" action 1.0 cli command "ping 8.8.8.8" ! end
09-21-2018 01:55 PM
09-21-2018 02:31 PM
09-21-2018 09:26 AM
09-21-2018 09:36 AM
@Rob Ingram wrote:
Hi, My first thought would be that you required DPD in order to clear the old SAs.
Hmmm... The issue with that is the SA's look like new ones (they have the new address) but it's easy enough to try. It will take me a bit - I'm in the middle of a different (but similar) experiment, I'm turning off the tunnel on each transition for 60 seconds, to see if that forces things to clear out.
I have also manually done a clear crypto ikev2 sa fast (or whatever that syntax was) on each router, once it gets 'stuck' in this mode, and it has no beneficial impact - the ikev2 sa's go away as expected, but come back with this same weird mode. I realize ikev2 sa and dpd operate on a slightly different level (though not sure how related it is to clearing them).
I'll give it a try in an hour or so and update.
09-21-2018 12:01 PM
I added this to both routers, is there more that would be needed?
crypto isakmp keepalive 30 10 periodic
It had no effect that I can see.
I also added some event manager code so each time there is a change, it shuts the tunnel interface down for one minute to give things time to clear (not sure if that's long enough). That had no effect either.
09-21-2018 12:08 PM - edited 09-21-2018 12:42 PM
Try "dpd 10 2 on-demand" either global or under the IKEv2 profile. You can determine it's working by running a "debug crypto ikev2" and look for R U There messages.
EDIT - global command is "crypto ikev2 dpd 10 2 on-demand"
09-21-2018 01:03 PM
Not sure that changed anything; I am not seeing the messages from them, but then again on-demand is only if no data, and I'm sending a lot of pings already to make sure things are up.
With debug on for ikev2 I see a lot of retransmitting from both sides for payload or ENCR. Eventually it hist max retransmissions, negotiates again (during which it appears to receive responses), then repeats.
I'm starting to wonder if I have a NAT issue of some sort. Everything here is NAT'd twice. The termination router is inside an ASA that's NAT'd, the Portable router is doing NAT itself, the cellular network is providing a NAT'd address also. But I'm starting to wonder if somewhere in that I'm getting either a port reused that is mis-directing a packet to the prior translation. I'm starting to dig there a bit, try to find why packets from one end aren't making it to the other (if that's really what's happening). That might explain why a reboot -- and the ensuring lack of traffic for a period -- might clear a PAT entry.
09-21-2018 01:39 PM
OK, that was pointless, I set it up ages ago. it's a static 1:1 nat so no PAT involved.
I'm worried I need a sniffer somewhere to see where the packets are (not) going.
09-21-2018 01:54 PM
cco@leferguson.com wrote:
I'm worried I need a sniffer somewhere to see where the packets are (not) going.
In your first you confirmed the Hub was creating VA interfaces over and over, so therefore I'd conclude the packets are going to the hub.
I've found previously when tracking an interface with FlexVPN to have a down and up delay of say 60 seconds, combined with dpd (my notes indicated I used periodic rather than on-demand). I could flip between shutting down the primary interface, automatically establishing a tunnel on the secondary and then no shut the primary interface and re-establish a tunnel without issue.
09-21-2018 03:52 PM
cco@leferguson.com wrote:
I'm worried I need a sniffer somewhere to see where the packets are (not) going.
OK, once I remembered to turn off the route cache, I could see the packets, there is some kind of addressing problem. Here's an example:
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature Sep 21 21:55:38.941: UDP src=4500, dst=4500, Common Flow Table(5), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Stateful Inspection(8), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Dialer i/f override(25), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Virtual Fragment Reassembly(39), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=10.192.67.219, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, Virtual Fragment Reassembly After IPSec Decryption(57), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, NAT Outside(92), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 128, input feature
Sep 21 21:55:38.941: UDP src=4500, dst=4500, MCI Check(109), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 21 21:55:38.941: FIBipv4-packet-proc: route packet from Cellular0 src X.X.X.X dst 172.16.1.100 Sep 21 21:55:38.941: FIBfwd-proc: Default:0.0.0.0/0 process level forwarding Sep 21 21:55:38.941: FIBfwd-proc: depth 0 first_idx 0 paths 1 long 0(0) Sep 21 21:55:38.941: FIBfwd-proc: try path 0 (of 1) v4-sp first short ext 0(-1) Sep 21 21:55:38.941: FIBfwd-proc: v4-sp valid Sep 21 21:55:38.941: FIBfwd-proc: no nh type 8 - deag Sep 21 21:55:38.941: FIBfwd-proc: ip_pak_table 0 ip_nh_table 65535 if none nh none deag 1 chg_if 0 via fib 0 path type special prefix Sep 21 21:55:38.941: FIBfwd-proc: Default:0.0.0.0/0 not enough info to forward via fib (none none) Sep 21 21:55:38.941: FIBipv4-packet-proc: packet routing failed Sep
Now here's the thing.... the X.X.X.X is the headquarters outside IP, so is right. The 10.192.67.219 is the cellular address of the moment.
Part way thru notice the destination changes from 10.192.67.219 to 172.16.1.100. That latter WAS the IP address of the ethernet (ISP) interface that at this moment is not active. The real destination should be, I think, the tunnel IP? That would be 172.25.39.1 which is on the Tunnel Interface as the Lookback (for no-ip usage).
At this point I can see:
Portable1#show ip nat trans global Pro Inside global Inside local Outside local Outside global udp 10.192.67.219:4500 172.16.1.100:4500 X.X.X.X:4500 X.X.X.X:4500
I've got two NAT statements:
ip nat inside source route-map ISP6 interface Cellular0 overload ip nat inside source route-map ISP7 interface GigabitEthernet0 overload
the route maps above are:
route-map ISP6 permit 10 match ip address INSIDE-TO-INSIDE-DENY-ACL match interface Cellular0 ! route-map ISP7 permit 10 match ip address INSIDE-TO-INSIDE-DENY-ACL match interface GigabitEthernet0
Ignoring for the moment the ip address, as I understand it ISP6 should have been chosen for translation of data from cellular 0, but it is using the translation in the second NAT -- more to the point it's using a translation that is to an address that no longer even exists. Worse squared: At no time did 10.192.67.219 (the cellular interface) ever communicate with 172.16.1.100 (the ethernet interface to the ISP). Both have IP NAT OUTSIDE on them.
So I'm at a real loss what i am doing, but the issue appears to be these two NAT statements and the route map source for them. Which does work once -- but not after a change.
But I feel like I'm getting closer now.
09-21-2018 07:21 PM
Yes, the issue is definitely that NAT is getting a bad cached value. During the transition, for reasons I cannot explain, I get a local inside address if the turned-admin-down interface, and inside global address of the newly up WAN interface. E.g.
#show ip nat trans Pro Inside global Inside local Outside local Outside global udp 172.16.1.100:4500 10.96.68.104:4500 X.X.X.X:4500 X.X.X.X:4500
That first address is the Gig 0 interface that just came up with an ISP's DHCP address, and the 10.96.68.104 address is the no-longer-active address of the Cellular 0 interface that was just shut down. And traffic just keeps on keeping it alive. If I clear ip nat trans * then the tunnel immediately comes up and stays up.
I've been partially successful in a kludge of doing the clear in event manager on line translation but not completely, the translation sticks around a long time. And I don't want to clear nat translations too long after as that will interfere with unrelated internet access through NAT for real.
What I'd really like to know is how I'm getting a nat translation so crossed up as this -- from one WAN (outside) port to the other WAN (outside) port. And only while the interfaces are in transition (I'm not quite sure if it's when the new one goes up, or when the old comes down).
I can probably bang away on the event manager kludge of clearing translations and get this to work, but...
Any ideas why it is happening?
09-22-2018 07:08 AM
OK, I think I understand this and have it fixed. It's a bit weird (or seems so to me). First what I see with a lot of detail shown in debug.
With the Gi0 interface up at 172.16.1.10, packets are going out without NAT, e.g.
Sep 22 13:03:18.507: IP: s=172.16.1.100 (local), d=X.X.X.X (GigabitEthernet0), len 128, output feature Sep 22 13:03:18.507: UDP src=4500, dst=4500, Post-routing NAT Outside(26), rtype 1, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
I then remove the IP address from Gi0 to force a failure. Before the cellular link is brought up I get this, as expected:
Sep 22 13:03:23.552: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 process level forwarding Sep 22 13:03:23.552: FIBfwd-proc: depth 0 first_idx 0 paths 1 long 0(0) Sep 22 13:03:23.552: FIBfwd-proc: try path 0 (of 1) v4-sp first short ext 0(-1) Sep 22 13:03:23.552: FIBfwd-proc: v4-sp valid Sep 22 13:03:23.552: FIBfwd-proc: no nh type 8 - deag Sep 22 13:03:23.552: FIBfwd-proc: ip_pak_table 0 ip_nh_table 65535 if none nh none deag 1 chg_if 0 via fib 0 path type special prefix Sep 22 13:03:23.552: FIBfwd-proc: Default:0.0.0.0/0 not enough info to forward via fib (none none) Sep 22 13:03:23.552: FIBipv4-packet-proc: packet routing failed
This attempt continues, however, I assume originating from the now dying sa. The cellular comes up (10.96.68.104) and the track activates from its continual pings, which installs a route to the cellular as default. All correct. But this SA that's still sending
Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=X.X.X.X, len 108, local feature
Sep 22 13:05:21.260: UDP src=4500, dst=4500, Policy Routing(3), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE
Sep 22 13:05:21.260: FIBipv4-packet-proc: route packet from (local) src 172.16.1.100 dst X.X.X.X
Sep 22 13:05:21.260: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0
Sep 22 13:05:21.260: FIBipv4-packet-proc: packet routing succeeded
Sep 22 13:05:21.260: IP: s=172.16.1.100 (local), d=66.158.37.6 (Cellular0), len 108, sending
Sep 22 13:05:21.260: UDP src=4500, dst=4500
Sep 22 13:05:21.260: NAT: s=172.16.1.100->10.96.68.104, d=66.158.37.6 [17162]
Now it sends through the cellular, and in doing so caches a translation that ties the cellular address to the now defunct ISP address.
#show ip nat trans Pro Inside global Inside local Outside local Outside global udp 10.96.68.104:4500 172.16.1.100:4500 X.X.X.X:4500 X.X.X.X:4500
This causes inbound packets that SHOULD be delivered to the new tunnel's Cellular address to be translated inappropriately back to the defunct ISP address, since they use the same port.
Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.738: UDP src=4500, dst=4500, Common Flow Table(5), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.738: UDP src=4500, dst=4500, Stateful Inspection(8), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.738: UDP src=4500, dst=4500, Dialer i/f override(25), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.738: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, Virtual Fragment Reassembly(39), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=10.96.68.104, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, Virtual Fragment Reassembly After IPSec Decryption(57), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: NAT: s=X.X.X.X, d=10.96.68.104->172.16.1.100 [56943] Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, NAT Outside(92), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: IP: s=X.X.X.X (Cellular0), d=172.16.1.100, len 144, input feature Sep 22 13:05:26.742: UDP src=4500, dst=4500, MCI Check(109), rtype 0, forus FALSE, sendself FALSE, mtu 0, fwdchk FALSE Sep 22 13:05:26.742: FIBipv4-packet-proc: route packet from Cellular0 src X.X.X.X dst 172.16.1.100 Sep 22 13:05:26.742: FIBfwd-proc: packet routed by adj to Cellular0 0.0.0.0 Sep 22 13:05:26.742: FIBipv4-packet-proc: packet routing succeeded
I'm not that strong in NAT and routing to know if this is what should happen, if I have a bad configuration somewhere that's causing it, if it's a bug, or feature. Clearing NAT translations fixes it, but the timing is tough -- the traffic originating at the defunct IP address continues for a long time, and you have to keep clearing it (or wait a long time, like minutes). DPD is either not happening, or is not happening fast enough, or it's happening at the wrong end... not sure.
Since both WAN interfaces are negotiated it's not possible to use ACL's in the NAT to exclude them. I'm using a route map now with an access list to NOT match internal to internal addresses, that does match internal to external. Unfortunately both the cellular and ISP interfaces are internal private addresses (out of our control).
What seems to fix it is to extend the access list that is in the route map so that it explicitly denies any match on the destination tunnel peer (e.g. host x.x.x.x any and any host x.x.x.x). This causes the route map for the NAT statement to fail, and any traffic to the destination peer is not translated. This means to ping it I cannot ping from an internal address, but that's a minor detail (I guess I could explicitly allow on-router private addresses but that's more complexity).
Anyway, so far with that added, no NAT cache occurs, and the transition occurs properly. I've done it twice now, after some breakfast will test more extensively, but I think this is going to be the fix. BTW I had a NAT statement on the Tunnel interface which I removed as well, though not sure that had any impact (I removed it long before this fix and it still failed the same way).
One other quirk -- changing the access list did not work at first, it required a reboot. Not sure why, but I THINK something about the NAT or route map evaluation had cached the access list and did not recognize a changed version. After a reboot it seemed to work properly -- proof that Cisco was really out to mess with my head on this one.
It's also worth mentioning that none of this becomes a real issue if you have external addresses not in your NAT list for the WAN ports, so static ports with public addresses would not have any issues nor would private addresses already excluded from your NAT by access list. However, in this case we know we don't control either the ISP addresses nor cellular addresses, so we left them implicitly in. This also means we might get a case with an actual conflict of a necessary address (e.g. a local ISP with the same subnet as an internal site we need to route to). There's not a lot we can do about that, I think.
Anyway, I'll update this later if I find more, but I think that's the answer. If anyone knows whether the NAT cache it did in the list above is right I'd love to know. I would have hoped that the removal of the IP from the interface would somehow stop either the SA from using it, or the routing engine from routing, or NAt from NAT'ing. But all happily kept using it.
Thanks for all the help.
Linwood
09-25-2018 07:47 AM
I've been testing this now for a few days and it seems to work, so I think this is the issue -- a need to exclude the tunnel peer (static) address from NAT explicitly, since the cellular and/or isp dynamic addresses are likely to be private and otherwise included.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide