there are these syslog message that we occasionally get on our XR router with regards to BGP NSR going down on the standby RP. When we check the BGP neighborship or BGP NSR related commands were not seeing any flaps or anything unusual. Any particular command that we should use to troubleshoot this and what causes this log message to show up
RP/0/RSP0/CPU0:May 12 05:43:09.408 UTC: tcp: %IP-TCP_NSR-5-DISABLED : x.x.x.x:x <-> x.x.x.x.x:x:: NSR disabled for TCP connection because Retransmission threshold exceeded
RP/0/RSP1/CPU0:May 12 05:43:09.409 UTC: bgp: %ROUTING-BGP-5-NBR_NSR_DISABLED_STANDBY : NSR disabled on neighbor x.x.x.x on standby RP due to Peer closing down the session (VRF: default)
RP/0/RSP0/CPU0:May 12 05:43:09.409 UTC: bgp: %ROUTING-BGP-3-NBR_NSR_DISABLED : NSR disabled on neighbor x.x.x.x due to 'ip-tcp' detected the 'warning' condition 'NSR is down because the retransmission threshold exceeded (probably because downstream RP is not healthy)'
NSR troubleshooting involves a lot of components, the first thing to check on an asr9k is show pfm loc all for any failure in the punt inject path from lc to standby/active rsp. Then we need to check if the issue is occurring for a lot of different peers or just one and multiple times or just once. Check that the timers for BGP are not aggressive. Check tcp dump-files and traces, socket traces, bgp traces, nsr traces, etc. I would recommend opening a TAC case to get to the bottom of it as it involves a lot of processes talking to each other.
In short the reason for NSR going disabled on a peer is that after 60% of the session timeout we dont have a hello message, that then switches how the hellos messages are sent and received from using the standby to the active RSP.