dual wan failover config: failback does not always work as expected for existing LAN traffic flows

Steve Dixon · ‎12-11-2014

I have an 881 router configured with 2 dhcp WAN connections. I am trying to configure failure detection of the primary connection (I do not really care about the secondary at this time).

I have an ip sla/track configured to monitor the primary WAN connection, and if it stops passing traffic it removes that route, passing all traffic out the second WAN connection. When the first connection is restored it should restore the route and everything should pass through the first connection again. This works for all my tests except one. If I start a ping stream from a client "ping 8.8.8.8 -t" and disconnect the primary connection it will lose a few packets but then use the secondary connection in about 15 seconds. After restoring the primary connection all new traffic will use the primary connection, but the ping stream will then stop working (fails over, but not back). If I stop the ping stream for a time (not sure how long is required, but my test was over a minute) it will then use the primary connection like all other new traffic. A stop of a few seconds is not enough, and even opening up a second command prompt to ping the same target also does not work (pinging new targets works as desired). It is as if something is caching the route/session/whatever and it has to have a window of no traffic before expiring/relearning the route. This means any sustained traffic to the original target will not work until it is stopped for a certain time to let "something" age out.

I need to know if there is a way to "flush the cache" (or whatever) during fail-back to force the primary route to be used after fail-back, or something else that will have the same effect. My suspicion is that the second route gets "preferred" because the first is removed by the sla, and when the sla returns the route to the list the existing traffic flow is not aware of the route list change, using the last known good route (which now does not pass traffic). The Issue here is that it takes a length of time for the now bad route to get flushed, which is greater than I want to have.

config (edited):

interface FastEthernet3
description Backup ISP
switchport access vlan 800
no ip address

interface FastEthernet4
description Primary ISP
ip dhcp client route track 100
ip address dhcp
ip nat outside
ip virtual-reassembly in
duplex auto
speed auto
crypto ipsec client ezvpn EZVPN-to-1941

interface Vlan800
description Backup ISP
ip address dhcp
ip nat outside
ip virtual-reassembly in

track 100 list boolean or
object 101
object 102
track 101 ip sla 10 reachability
track 102 ip sla 20 reachability

ip sla 10
icmp-echo 4.2.2.2 source-interface FastEthernet4
threshold 1000
timeout 1500
frequency 5
ip sla schedule 10 life forever start-time now

ip sla 20
icmp-echo 208.67.222.222 source-interface FastEthernet4
threshold 1000
timeout 1500
frequency 5
ip sla schedule 20 life forever start-time now

ip route 4.2.2.2 255.255.255.255 FastEthernet4 permanent
ip route 10.1.2.0 255.255.255.0 <1941 wan ip removed>
ip route <1941 wan ip removed> 255.255.255.255 FastEthernet4 permanent
ip route 208.67.222.222 255.255.255.255 FastEthernet4 permanent
ip route 0.0.0.0 0.0.0.0 Vlan800 dhcp 254
ip route 0.0.0.0 0.0.0.0 FastEthernet4 dhcp

Observation: the last 2 routes appear in the order shown above. Even though the vlan800 route has a higher administrative cost it is in front of the FA4 route, could this be contributing to the issue? Is there a way to ensure the FA4 route is always listed before vlan800 at all times?