cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1181
Views
20
Helpful
9
Replies

ASR9000 BGP issues

IBEngTeam
Level 1
Level 1

Hi All.

We have two ASR9K Version 6.4.2, configured with bgp and bfd.

Sometimes the bgp session drops with the following errors:

Router 1:

RP/0/RSP0/CPU0:2022 Aug 1 13:55:42.742 IDT: tcp[467]: %IP-TCP_NSR-5-DISABLED : 10.1.1.60:179 <-> 10.255.255.73:64767:: NSR disabled for TCP connection because Retransmission threshold exceeded
RP/0/RSP0/CPU0:2022 Aug 1 13:55:42.742 IDT: bgp[1067]: %ROUTING-BGP-3-NBR_NSR_DISABLED : NSR disabled on neighbor 10.1.1.73 due to 'ip-tcp' detected the 'warning' condition 'NSR is down because the retransmission threshold exceeded (probably because downstream RP is not healthy)'
RP/0/RSP1/CPU0:2022 Aug 1 13:55:42.742 IDT: bgp[1067]: %ROUTING-BGP-5-NBR_NSR_DISABLED_STANDBY : NSR disabled on neighbor 10.1.1.73 on standby RP due to Peer closing down the session (VRF: default)
RP/0/RSP0/CPU0:2022 Aug 1 13:57:23.690 IDT: bgp[1067]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.1.1.73 Down - BGP Notification received, hold time expired (VRF: default; AFI/SAFI: 1/1, 1/4, 1/128, 2/128, 25/65) (AS: 1234)
RP/0/RSP0/CPU0:2022 Aug 1 13:58:15.058 IDT: bgp[1067]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.1.1.73 Up (VRF: default; AFI/SAFI: 1/1, 1/4, 1/128, 2/128, 25/65) (AS: 1234)

 

Router 2:

RP/0/RSP0/CPU0:2022 Aug 1 13:57:23.662 IDT: bgp[1067]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.1.1.60 Down - BGP Notification sent, hold time expired (VRF: default; AFI/SAFI: 1/1, 1/4, 1/128, 2/128, 25/65) (AS: 1234)
RP/0/RSP1/CPU0:2022 Aug 1 13:57:23.661 IDT: bgp[1067]: %ROUTING-BGP-5-NBR_NSR_DISABLED_STANDBY : NSR disabled on neighbor 10.1.1.60 on standby RP due to BGP Notification sent (VRF: default)
RP/0/RSP0/CPU0:2022 Aug 1 13:58:15.085 IDT: bgp[1067]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.1.1.60 Up (VRF: default; AFI/SAFI: 1/1, 1/4, 1/128, 2/128, 25/65) (AS: 1234)

While the bgp goes down the bfd remains up:

sh bfd session destination 10.1.1.73 detail
Location: 0/0/CPU0
Dest: 10.1.1.73
Src: 10.1.1.60
State: UP for 13d:4h:50m:9s, number of times UP: 4
Session type: SW/V4/MH
Received parameters:
Version: 1, desired tx interval: 200 ms, required rx interval: 200 ms

Also, we have ipsla between the routes and there is no packet loss, or jitter. the router are connected with a single mode 10 gig, and rx/tx powers are fine.

What can be the issue? How can we degug this?

Regards,

Adi.

 

 

9 Replies 9

philclemens1835
Level 1
Level 1

One thing I've seen is when the link gets saturated, BFD packets can drop and bring down BGP.  This can be solved with a service policy that prioritizes BFD and BGP, or simply any communication between the two interface IP's involved on this link.  Judging by your IPSLA comment, I'm guessing this isn't the case, but it is one thing that can cause this issue on a link where layer 1 is solid.

Hi,

Thanks for the response.

This is not the issue, as the bfd session remains up.

Adi.

smilstea
Cisco Employee
Cisco Employee

You are using NSR so the punt inject path is different than for BFD. BFD depending on what mode of BFD you are using and with echo or not will be consumed by the LC CPU or active RSP.

https://community.cisco.com/t5/service-providers-knowledge-base/bfd-support-on-cisco-asr9000/ta-p/3153191

 

Here is a good overview of NSR: https://community.cisco.com/t5/xr-os-and-platforms/bgp-flaps-asr9000/td-p/2913021

 

What are your BGP timers? Normally if there is a hiccup in NSR forwarding and retransmissions we can recover gracefully by letting the active RSP forward the BGP packet directly instead of relying on the standby until it gets healthy again. I often see problem with timers of like 10s or 5s for BGP that simply aren't supported or needed, you should never need to adjust a routing protocols hello timers, that is why BFD exists and NSR exists, to handle these situations. On the flip side if you are seeing NSR messages like these constantly from my experience it can indicate a serious issue in the OS such as a software bug or hardware error causing the packets from the standby to not go out or be delivered to the active RSP (either case). I recommend checking the timers, show pfm loc all, show alarms in 64-bit, and worst case you might need to open a tac case to collect some logs and isolate if this is hardware or software.

 

Sam

Hi Sam,

Thanks for the reply.

We have not changed the bgp timers, for that we use the bfd - as you suggested.

One more thing to mention - we see the same issue with ldp peers:

RP/0/RSP0/CPU0:2022 Aug 4 06:36:33.600 IDT: tcp[467]: %IP-TCP_NSR-5-DISABLED : 10.1.1.60:646 <-> 10.1.1.73:36070:: NSR disabled for TCP connection because Retransmission threshold exceeded
RP/0/RSP0/CPU0:2022 Aug 4 06:36:33.600 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_PEER_SYNC_LOST : VRF 'default' (0x60000000), Peer 10.1.1.73:0 synchronization lost
RP/0/RSP0/CPU0:2022 Aug 4 06:36:35.705 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_SYNC_START : Initial synchronization started for 1 peers
RP/0/RSP0/CPU0:2022 Aug 4 06:36:37.765 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_SYNC_PEER_DONE : VRF 'default' (0x60000000), Peer 10.1.1.73:0 initial synchronization done
RP/0/RSP0/CPU0:2022 Aug 4 06:36:37.765 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_SYNC_DONE : Initial synchronization successfully done for 1 peers

The thing is that we have multiple devices with same hardware and same software version and same relevant configuration, and this happens only on one router!

So, this not seems to be a software bug, otherwise we would have seen the same in other devices, right? maybe a hardware issue? do you have anymore ideas, or troubleshooting advice?

Adi.

So maybe it is a bug or

This tells me the issue is with the standby RSP or something in NSR, most likely a hardware issue if its impacting BGP NSR and LDP NSR. I would open a tac case, attach show tech tcp nsr, show tech routing bgp, show tech mpls ldp, show platform, show install active summary, show pfm loc all, show logging.

Agree with Sam,

In NSR BGP/LDP packet needs to go from Active RSP to Standby and only then to egress LC. BFD is either Active RSP -> LC directly, or even LC -> wire (hw offloaded) as Sam mentioned above.

One option to try is either to do a switchover or timely isolate Standby RSP and see if flaps stop that can mean smth is not healthy in communication between RSP.

Niko

HTH,
Niko

RP/0/RSP0/CPU0: this run BFD 

RP/0/RSP1/CPU0: this run BGP 

since the BFD is process in Link not in CPU then even if BFD is UP in RSP0 the BGP down in RSP1
only change the interface you use to connect to neighbor to be any interface process by RSP1 not RSP0.

IBEngTeam
Level 1
Level 1

Hi all,

Thanks for the responses.

We will replace the standby cpu and see if this solve the issue.

We will update.

Adi.

I was facing the same issue and I have solved it by unified the MTU size from both sides 

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: