08-02-2022 08:17 PM
Hi All.
We have two ASR9K Version 6.4.2, configured with bgp and bfd.
Sometimes the bgp session drops with the following errors:
Router 1: RP/0/RSP0/CPU0:2022 Aug 1 13:55:42.742 IDT: tcp[467]: %IP-TCP_NSR-5-DISABLED : 10.1.1.60:179 <-> 10.255.255.73:64767:: NSR disabled for TCP connection because Retransmission threshold exceeded
Router 2: RP/0/RSP0/CPU0:2022 Aug 1 13:57:23.662 IDT: bgp[1067]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.1.1.60 Down - BGP Notification sent, hold time expired (VRF: default; AFI/SAFI: 1/1, 1/4, 1/128, 2/128, 25/65) (AS: 1234) |
While the bgp goes down the bfd remains up:
sh bfd session destination 10.1.1.73 detail Location: 0/0/CPU0 Dest: 10.1.1.73 Src: 10.1.1.60 State: UP for 13d:4h:50m:9s, number of times UP: 4 Session type: SW/V4/MH Received parameters: Version: 1, desired tx interval: 200 ms, required rx interval: 200 ms |
Also, we have ipsla between the routes and there is no packet loss, or jitter. the router are connected with a single mode 10 gig, and rx/tx powers are fine.
What can be the issue? How can we degug this?
Regards,
Adi.
08-03-2022 06:52 AM
One thing I've seen is when the link gets saturated, BFD packets can drop and bring down BGP. This can be solved with a service policy that prioritizes BFD and BGP, or simply any communication between the two interface IP's involved on this link. Judging by your IPSLA comment, I'm guessing this isn't the case, but it is one thing that can cause this issue on a link where layer 1 is solid.
08-03-2022 07:16 AM
Hi,
Thanks for the response.
This is not the issue, as the bfd session remains up.
Adi.
08-03-2022 01:34 PM
You are using NSR so the punt inject path is different than for BFD. BFD depending on what mode of BFD you are using and with echo or not will be consumed by the LC CPU or active RSP.
Here is a good overview of NSR: https://community.cisco.com/t5/xr-os-and-platforms/bgp-flaps-asr9000/td-p/2913021
What are your BGP timers? Normally if there is a hiccup in NSR forwarding and retransmissions we can recover gracefully by letting the active RSP forward the BGP packet directly instead of relying on the standby until it gets healthy again. I often see problem with timers of like 10s or 5s for BGP that simply aren't supported or needed, you should never need to adjust a routing protocols hello timers, that is why BFD exists and NSR exists, to handle these situations. On the flip side if you are seeing NSR messages like these constantly from my experience it can indicate a serious issue in the OS such as a software bug or hardware error causing the packets from the standby to not go out or be delivered to the active RSP (either case). I recommend checking the timers, show pfm loc all, show alarms in 64-bit, and worst case you might need to open a tac case to collect some logs and isolate if this is hardware or software.
Sam
08-03-2022 11:40 PM
Hi Sam,
Thanks for the reply.
We have not changed the bgp timers, for that we use the bfd - as you suggested.
One more thing to mention - we see the same issue with ldp peers:
RP/0/RSP0/CPU0:2022 Aug 4 06:36:33.600 IDT: tcp[467]: %IP-TCP_NSR-5-DISABLED : 10.1.1.60:646 <-> 10.1.1.73:36070:: NSR disabled for TCP connection because Retransmission threshold exceeded RP/0/RSP0/CPU0:2022 Aug 4 06:36:33.600 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_PEER_SYNC_LOST : VRF 'default' (0x60000000), Peer 10.1.1.73:0 synchronization lost RP/0/RSP0/CPU0:2022 Aug 4 06:36:35.705 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_SYNC_START : Initial synchronization started for 1 peers RP/0/RSP0/CPU0:2022 Aug 4 06:36:37.765 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_SYNC_PEER_DONE : VRF 'default' (0x60000000), Peer 10.1.1.73:0 initial synchronization done RP/0/RSP0/CPU0:2022 Aug 4 06:36:37.765 IDT: mpls_ldp[1225]: %ROUTING-LDP-5-NSR_SYNC_DONE : Initial synchronization successfully done for 1 peers |
The thing is that we have multiple devices with same hardware and same software version and same relevant configuration, and this happens only on one router!
So, this not seems to be a software bug, otherwise we would have seen the same in other devices, right? maybe a hardware issue? do you have anymore ideas, or troubleshooting advice?
Adi.
So maybe it is a bug or
08-05-2022 02:23 PM
This tells me the issue is with the standby RSP or something in NSR, most likely a hardware issue if its impacting BGP NSR and LDP NSR. I would open a tac case, attach show tech tcp nsr, show tech routing bgp, show tech mpls ldp, show platform, show install active summary, show pfm loc all, show logging.
08-05-2022 02:44 PM
Agree with Sam,
In NSR BGP/LDP packet needs to go from Active RSP to Standby and only then to egress LC. BFD is either Active RSP -> LC directly, or even LC -> wire (hw offloaded) as Sam mentioned above.
One option to try is either to do a switchover or timely isolate Standby RSP and see if flaps stop that can mean smth is not healthy in communication between RSP.
Niko
08-05-2022 03:28 PM
RP/0/RSP0/CPU0: this run BFD
RP/0/RSP1/CPU0: this run BGP
since the BFD is process in Link not in CPU then even if BFD is UP in RSP0 the BGP down in RSP1
only change the interface you use to connect to neighbor to be any interface process by RSP1 not RSP0.
08-05-2022 11:16 PM
Hi all,
Thanks for the responses.
We will replace the standby cpu and see if this solve the issue.
We will update.
Adi.
08-02-2023 07:36 AM
I was facing the same issue and I have solved it by unified the MTU size from both sides
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: