Hello,

nsantos77 · ‎02-02-2016

Hello,

I believe this discussion is not new, i tried to follow documentation and other posts but still i have some doubts, this is our scenario:

2 x Cisco 7210 routers. Each of this routers are connected to separate carriers. Let's call this routers Router A and B

Since i'm running a managed service BGP i can't share here the conf, however, this is what i would like to understand.

By 'default' my traffic goes to the Provider A while the Provider B exists as a backup in case of failure.

When i physically cut the link to Provider A the BGP is taking around 2m30s to failover to the Provider B.

As i read there's a BGP keepalive and hold-down timer (by default is 60s keepalive and 3x60s for the hold-down), however, i saw on a Cisco thread that if there's a physical cut (Layer 1) on the link the BGP timers have no effect on this and the failover to the backup link is immediate.

Is this true?

Basically i'm trying to reduce the failover timers that now is around 2m30s when link goes down...

PS: I have very limited knowledge of BGPs that's why my terms are so rookie.

Joseph W. Doherty · ‎02-02-2016

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages wha2tsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

Generally, if a physical p2p link fails, the hardware usually detects that very rapidly and a "signal" is sent to the routing processes that the link has failed. With that information, routing protocols can start a re-convergence (and usually do), also very rapidly.

If a logical link fails, often the routing process needs some kind of "hello/keepalive" process to notice the path loss. By default, this can take some time, and how fast a routing protocol detects this, can vary between protocols. BGP tends to be slower than IGPs but even the latter can be much longer than desired (without some timer tuning). For example, Cisco OSPF, on a logical p2p link, can take 40 seconds before it "declares" a neighbor loss. With OSPF hello interval tuning, especially if BFD is supported, subsecond logical link failure is possible.

Jose Jara · ‎02-02-2016

Hello,

Be aware that there are more things to take into account apart from the failure detection time. When a protocol detects or it is notified about the failure, then it has to notify to the other routers in the network about the failure and also these routers have to propagate (after running BGP Best Path Selection) it and so on.

You are right about the keepalive/hold timers and these are 60/180 seconds by default. However, the value is negotiated to the minimum during BGP session establishment. You might verify it with show ip bgp neighbors.

In the particular case you are talking about, a physical cut, you are also right that if the interface goes down, by default, the BGP session is automatically brought down by a mechanism called BGP Fast External Fallover, which is enabled by default for eBGP. Now, let´s suppose that this happens in your end but yo do not know if in the provider end it occurs the same. It will depend if the other end receives the loss of carrier or not. In case there is a direct link between both ends, that will be the case. However, if there is any Layer 2 network between your router and provider router, SP device will have to wait until hold timer expiration to detect the failure.

The second part, about failure propagation, there is a timer called Minimum Advertisement Interval (MRAI) that is 30 seconds by default in eBGP. You may also see it with show ip bgp neighbors. It controls the BGP Update propagation. So, if there are several ASN´s between your AS and the destination where you are testing this, it may take more time depending of N upstreams ISP´s convergence time/configuration depending of the MRAI timer.

I have omitted other details of convergence time, but the most relevant parts/timers are the failure detection and failure propagation.

Best Regards,

Jose.