BGP - No Keepalives and two second updates.

bradleyordner · ‎03-19-2020

We recently peered with a new ISP and I am seeing something very weird. According to theory this shouldn’t be working but wanted to check if anyone else has seen this.

We receive a table of about 300,000 prefixes and have cut it down to about 32k plus a default route.

This BGP peer and neighbourship is stable but I have only seen 5 keep alive messages and our default route update from the ISP is sent every 2 seconds. Our router accepts the update and Installs over and over. So the default route

is no more than 2 seconds old.

i have taken a packet capture and it is indeed sent every 2 seconds. How can this be?

I thought updates were only eat sent when prefixes change and I also don’t understand that without keepalives the BGP session doesn’t drop?

Cristian Matei · ‎03-19-2020

Hi,

It doesn't sound like the ISP router speaks BGP as it should (maybe a bug, maybe something missconfigured), as you should nit see the default route flapping like that. Can you post the BGP packet capture?

As for the session staying up, it depends on the configured and negotiated timers. Can you post the output of "

show bgp ipv4 unicast neighbors x.x.x.x"?

Regards,

Cristian Matei.

paul driver · ‎03-19-2020

Hello

Can you confirm the bgp capability for the rtrs and its peers and post the router stanza for bgp?
sh ip bgp neighbors | s Neighbor capabilities
sh run | sec router bgp

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

bradleyordner · ‎03-19-2020

Thanks for replies so far. Some research I have done suggests that the PE may have keepalive set to 0 which means they rely on the BGP Updates to keep the session up.

In regards to the default route, here is some configs.

au-sydwwwrtr101#sh ip bgp neighbors x.x.x.x | s Neighbor capabilities
Neighbor capabilities:
Route refresh: advertised and received(new)
Four-octets ASN Capability: advertised and received
Address family IPv4 Unicast: advertised and received
Enhanced Refresh Capability: advertised
Multisession Capability:
Stateful switchover support enabled: NO for session 1

neighbor x.x.x.x remote-as xxxx
neighbor x.x.x.x description **TID-TLS-GLBL-TPN**
neighbor x.x.x.x timers 10 30
neighbor x.x.x.x soft-reconfiguration inbound
neighbor x.x.x.x prefix-list PL-SLASH21-SPECIFICS-ONLY out
neighbor x.x.x.x route-map RM-BEST-LP-IN in
neighbor x.x.x.x route-map RM-PREPEND out

Giuseppe Larosa · ‎03-20-2020

Hello @bradleyordner ,

it is not a common configuration to have keepalive interval set to 0, it might be a reasonable configuration only if BFD was in use between you and the other ISP.

Regarding your show command and your configuration please notice that the neighbor has Enhanced Refresh Capability: advertised so the command neighbor x.x.x.x soft-reconfiguration inbound is not needed anymore as explained by @Peter Paluch in some older threads.

The fact that the peer is sending the same BGP update every two seconds can be a sign of a poor implementation (the BGP update replaces the disabled keepalive) or the the BGP peer does not understand the answer of your router.

Your router should ackwlodge the receipt of the BGP update this is done in Cisco routers by sending a BGP keepalive with an additional field (TCP ack piggy back nothing new here).

So either your router is not allowed to ack the BGP update and the other keeps sending or the the other side does not process the BGP keepalive with ack in the payload and as a result of this it is sending the BGP route every two seconds.

In short you have interoperability issues with the other ISP device I would have a talk with them and I would use a non zero keepalive this should stabilize the BGP prefix and avoid unnecessary re-sending of the same update that is exactly the opposite of what BGP does and one of the keys of BGP scalability.

Hope to help

Giuseppe

bradleyordner · ‎03-20-2020

Hi Giuseppe,

Good to hear from you again. Thanks for the detailed reply. I have asked the ISP to confirm the keepalive configuration as I have assumed it is zero at this stage.

Last night the ISP decided to change the Advertisement Interval to 60 seconds. Now the default route resets after 60 seconds instead of 2 seconds.

All other routes are stable, so for some reason it only seems to be the default route affected.

I am going to try the following soon -

1. Reboot my router

2. Change our config to only receive the default and block all other prefixes

3. Ask ISP to change keepalive config if needed

They inform me they use this configuration on multiple customers so I am very curious why only we are affected. They are running a 9K and we are running a ASR1001-X.

Giuseppe Larosa · ‎03-21-2020

Hello Bradley,

nice to see you again and you are a Cisco Champion and now I know what it is (after CLEUR in Barcelona).

>> They inform me they use this configuration on multiple customers so I am very curious why only we are affected. They are running a 9K and we are running a ASR1001-X

OK be aware of the following:

Cisco ASR 9000 is a powerful distributed carrier grade router running IOS XR with the capability to delegate to single linecards the BFD handling.

So I suspect they use keepalive 0 + BFD enabled because they are not loading the main processor of the ASR 9000.

>> Last night the ISP decided to change the Advertisement Interval to 60 seconds. Now the default route resets after 60 seconds instead of 2 seconds.

You have an ASR 1001-X running IOS XE that is not distributed and cannot delegate to linecards BFD (it has no linecards as far as I know just you can install some SPA or modules but they are not linecards)

IT is strange that the only affected prefix is the 0.0.0.0/0. Either your device does not ack the reception of 0.0.0.0/0 or the answer for this specific prefix is misunderstood by the ASR9000 of the ISP. (likely the second)

>>I am going to try the following soon -

>>1. Reboot my router

>>2. Change our config to only receive the default and block all other prefixes

>>3. Ask ISP to change keepalive config if needed

I would go directly to point 3. Reloading your router should be left as the last option. Option 2 also is not recommended as you cannot show to the ISP that the other prefixes are stable and only the default route is affected by the advertisment interval.

if you have a free GE port on the ASR-1001 X you could use SPAN with source the interface to the ISP on the SPAN destination port you connect a laptop with wireshark running and you can check both ends of the communication

(I apologize if you have already done this it is not clear how you have performed the packet capture).

Be also aware of MSS in BGP, you might be able to solve by lowering the MSS (TCP level) to 1440 bytes. BGP might have a default of 4,000 bytes.

So unless you are using a modified MTU on the link the ASR9000 may be sending their prefixes in 4,000 bytes TCP segments each of them is fragmented.

If you want to rise the MTU remember the following:

in IOS XR MTU is a L2 concept so 1500 bytes MTU in IOS XE means :

1514 bytes in IOS XR main interface

1518 bytes in IOS XR on an 802.1Q tagged subinterface

So to use a L3 MTU of 9000 bytes you need:

ASR 1000 X mtu 9000

ASR 9000 mtu 9014 (and if using a subif mtu 9018 on the subif)

doing this will remove any fragmentation during the initial BGP routes exchange.

Hope to help

Giuseppe

bradleyordner · ‎03-24-2020

Great info and very helpful.

I found out the config on their side with keepalives is -

Configured hold time: 180, keepalive: 60, min acceptable hold time: 3

Troubleshooting to continue :-)

bradleyordner · ‎04-30-2020

Just thought I would update this thread, the carrier ended up labbing this for us and found it did not occur in the lab. The only difference was the IOS on the CE router and only receiving a default, not a partial table and default.

I still couldn't believe it, so we updated our router to 16.09.05 and the default is stable.

We were originally on asr1001x-universalk9.16.06.04.SPA.bin.

I will take a packet capture (on the router) again and check if the default update is coming every sixty seconds.