Re: EIGRP Neighbor down - routes not being removed immediately

s.p.r.mario · ‎02-25-2021

We are using a Cisco ASR router in a FlexVPN setup for providing connectivity to our remote clients. Each remote client will build two VPN tunnels to our central ASR router. As a routing protocol EIGRP has been selected on the VPN tunnels.

The problem is that when our ASR router loses its EIGRP neighborship on the primary VPN tunnel towards a remote client, it takes about 10 seconds before the ASR effectively removes the related routes from its route table. As you would expect this is causing a significant impact during a failover scenario.

To monitor this behavior more effectively I've added an EEM applet on the ASR router "event routing network 10.0.0.0/24" which will send out a syslog message during add/remove of these routes.

At 18:20:05 the ASR sees the EIGRP neighborship going down using interface Virtual-Access159:

Feb 24 18:20:05.892 CET: %DUAL-5-NBRCHANGE: EIGRP-IPv4 5: Neighbor 172.16.0.231 (Virtual-Access159) is down: holding time expired

However routes using Virtual-Access159 are being removed only about 10 seconds later:

Feb 24 18:20:16 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.0.0.96; Mask: 255.255.255.240; Protocol: EIGRP; GW: 172.16.0.231; Interface: Virtual-Access159;
Feb 24 18:20:16 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.0.0.192; Mask: 255.255.255.240; Protocol: EIGRP; GW: 172.16.0.231; Interface: Virtual-Access159;
Feb 24 18:20:16 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.0.0.128; Mask: 255.255.255.240; Protocol: EIGRP; GW: 172.16.0.231; Interface: Virtual-Access159;

We see this happening every time when we the EIGRP session stops on our ASR router. Only on this side it takes anything between 7 to 12 seconds before the old routes are being removed. When we check the remote client side the related routes are being removed instantly once the EIGRP neighborship goes down, a bit like you would expect it to work actually.

When looking at CPU levels on our ASR router we are never exceeding 5%, we also have sufficient memory available.

Does anyone have a clue what could be causing this delay on our ASR when routes should be removed after the EIGRP neighbor goes down? Or any tips on how to troubleshoot this properly? Thanks!

Georg Pauwen · ‎02-25-2021

Hello,

can you post the running config of your ASR ? Maybe we can spot something in there...

What platform are the remote clients running ?

s.p.r.mario · ‎02-26-2021

Hi Georg,

Attached you'll find the output of all relevant running-config.

Our remote clients are using IR829 routers, building 2 FlexVPN tunnels towards our central ASR router.

Thanks for your efforts! Let me know if you need additional info.

Georg Pauwen · ‎02-26-2021

Hello,

you posted a partial config. Post the full running config (sh run). I don't see a summary route advertised in EIGRP, is that not there, or did you not post that ?

s.p.r.mario · ‎02-26-2021

Hi Georg,

I've only included the relevant parts only as our internal security policy does not allow us to share full configs unfortunately, not even with all passwords and keys completely removed.

We do not have a summary route configured on the ASR. As you can see in the attached config we only redistribute one single static route into EIGRP (10.128.0.0/16) to advertise to the remote client on both VPN tunnels. This seems to work properly because the remote client will effectively remove the "primary route" and add the "secondary route" immediately after the EIGRP hold timer expired. You can see the output of this in one of my other posts.

MHM Cisco World · ‎02-25-2021

As I get
for the remote Site there is One Tunnel so when it down it immediate remove EIGRP update.
for the Hub "HQ" Site there is many tunnel so it depend on the timeout of EIGRP to remove update.

s.p.r.mario · ‎02-26-2021

If we look at what happens at the remote site and compare it to the central ASR they both see the EIGRP session going down at exactly the same time, what happens after is different.

REMOTE

At 18:20:05 the hold-time expires which makes the EIGRP neighbor going down, the remote client immediately removes the route on the primary Tu102 and adds a similar route via the secondary Tu101. This works completely as expected.

Feb 24 18:20:05.415 CET: %DUAL-5-NBRCHANGE: EIGRP-IPv4 5: Neighbor 10.127.255.15 (Tunnel102) is down: holding time expired
Feb 24 18:20:05 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.128.0.0; Mask: 255.255.0.0; Protocol: EIGRP; GW: 10.127.255.15; Interface: Tu102;
Feb 24 18:20:05 CET: %HA_EM-6-LOG: RouteMonitor: Type: add; Network: 10.128.0.0; Mask: 255.255.0.0; Protocol: EIGRP; GW: 10.127.255.14; Interface: Tu101;

CENTRAL

At 18:20:05 the hold-time also expires at the same time which makes the EIGRP neighbor going down simultaneously with the remote client. However it takes 10 seconds for the ASR to remove these specific /28 routes after the neighborship goes down.

Feb 24 18:20:05.892 CET: %DUAL-5-NBRCHANGE: EIGRP-IPv4 5: Neighbor 172.16.0.231 (Virtual-Access159) is down: holding time expired
Feb 24 18:20:16 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.0.0.96; Mask: 255.255.255.240; Protocol: EIGRP; GW: 172.16.0.231; Interface: Virtual-Access159;
Feb 24 18:20:16 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.0.0.192; Mask: 255.255.255.240; Protocol: EIGRP; GW: 172.16.0.231; Interface: Virtual-Access159;
Feb 24 18:20:16 CET: %HA_EM-6-LOG: RouteMonitor: Type: remove; Network: 10.0.0.128; Mask: 255.255.255.240; Protocol: EIGRP; GW: 172.16.0.231; Interface: Virtual-Access159;

The ASR does not have to add a route to its routing table in this case as it already has a /24 active route using the secondary Tu101. All it has to do is remove those /28 routes immediately if the related neighborship goes down.

This is our ASR routing table when both tunnels are active to the remote client:

ASR#show ip route 
D 10.0.0.0/24
 [90/26880256] via 172.16.0.130, 02:16:36, Virtual-Access40 
D 10.0.0.96/28
 [90/26880256] via 172.16.0.128, 00:05:54, Virtual-Access159 
D 10.0.0.128/28
 [90/26880256] via 172.16.0.128, 00:05:54, Virtual-Access159
D 10.0.0.192/28
 [90/26880256] via 172.16.0.128, 00:05:54, Virtual-Access159

Thanks for your efforts! Let me know if you have any questions.

MHM Cisco World · ‎02-26-2021

dpd 10 2 on-demand

are you config DPD under the IKEv2 profile?

MHM Cisco World · ‎02-25-2021

I recommend to use BFD with EIGRP and hence detect the neighbor down and router remove.
this config in HQ.

paul driver · ‎02-26-2021

Hello
As stated previously eigrp stuck in active (SIA) could be causing the delay so this needs to be clarified.
If it is SIA that’s occurring I believe the best way to negate it is to understand why its occurring and not simply by enabling BFD.

Using the commands should help identify SIA routers having query issues.

sh ip eigrp topology <x.x.x.x > check for feasible successors
sh ip eigrp topology active

Can the OP confirm the above also if its possible for the spoke rtrs to become eigrp stub rtrs, this will negate the hub rtr from querying them for successors and could decrease the delay they are experiencing.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

paul driver · ‎02-25-2021

Hello
For each network prefix do you have feasible successors for them, if you don’t then convergence will be slower than expected?

sh ip eigrp topology <x.x.x.x > check for feasible successors

When you run the failover again post the output from:
sh ip eigrp topology active

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

s.p.r.mario · ‎03-01-2021

Hey Paul,

Thanks for your feedback, I think you could be onto something here! First of all our remote clients are not configured as stub unfortunately, while they clearly should have been. If my understanding is correct this means that the DUAL process on our hub router will query all other active neighbors whenever the primary VPN tunnel on just one of our remote clients goes down.

The remote client will advertise its networks by default in /28 ranges on the primary VPN tunnel, and as a single summary /24 on the secondary VPN tunnel. So when the primary tunnel goes down there are no real feasible successors for these /28 routes and the DUAL process kicks in querying all other remote clients for these specific /28 prefixes.

#show ip eigrp topology 
P 10.0.0.0/28, 1 successors, FD is 26880256
  via 172.16.0.208 (26880256/2816), Virtual-Access56 
P 10.0.0.96/28, 1 successors, FD is 26880256
  via 172.16.0.208 (26880256/2816), Virtual-Access56 
P 10.0.0.128/28, 1 successors, FD is 26880256
  via 172.16.0.208 (26880256/2816), Virtual-Access56 
P 10.0.0.192/28, 1 successors, FD is 26880256
  via 172.16.0.208 (26880256/2816), Virtual-Access56

And effectively, hard to catch in the act but, after the primary tunnel goes down we can briefly see the routes are becoming active and the router is waiting for query replies from some other remote clients:

#show ip eigrp topology active
A 10.0.0.96/28, 0 successors, FD is 26880256
  4 replies, active never, query-origin: Local origin
    Remaining replies:
      via 172.16.0.97, r, Virtual-Access597
      via 172.16.0.232, r, Virtual-Access259
      via 172.16.0.8, r, Virtual-Access379
      via 172.16.0.93, r, Virtual-Access397

Do you know if the DUAL process will effectively have to wait for each reply -before- it removes those specific /28 routes from the active routing table? If this is really the case it explains the "dynamic delay" when removing these routes from our route table before falling back to the existing /24 summary route to the remote client.

Is there any short-term workaround we could implement on the hub router side? I've tried "timers active-time disable", however this has no impact it seems.

I believe the right way to solve this permanently is to configure all our remote clients as stub, as it should have been from the beginning. However this will be a resource intensive exercise as we have numerous remote clients, so this will be more of a mid to long-term solution unfortunately.

Thanks again for your efforts!

Giuseppe Larosa · ‎03-02-2021

Hello @s.p.r.mario ,

>> Do you know if the DUAL process will effectively have to wait for each reply -before- it removes those specific /28 routes from the active routing table?

Yes it is so and as noted by you and Paul the root cause of your problem is the fact that the remote routers are not EIGRP stub routers.

Hope to help

Giuseppe

MHM Cisco World · ‎02-26-2021

Again Friend you must get the idea here,
in Spoke there is Tunnel which have source and destination,

in Hub "HQ" there is not tunnel there is virtual-template so no physical interface here,

in Spoke when tunnel source is down it physical down and this make next-hop of EIGRP

in Hub the virtual-access is build according to IKEv2 "FlexVPNas called by cisco".
this virtual-access is not down because there is not physical port down here???
so what happened
DPD will responsible for monitor the tunnel "virtual-access" because the IKEv2 is build this tunnel.

when down the DPD will send the message and after sure that the tunnel is down it will remove the next-hop and hence remove the eigrp route.

how we solve this reduce the DPD timeout or use BFD.

if you use DPD debug the IKEv2 you will see DPD declare peer unreachable immediate before EIGRP remove from routing table of HQ.