Solved: One more point I want to add

Ali Muazzam · ‎06-27-2016

Hi All,

I have been seeing lots of BGP Peer flap alerts on my ASR9010. And often this syslog is observed

%ROUTING-BGP-3-NBR_NSR_DISABLED : NSR disabled on neighbor x.x.x.x due to 'ip-tcp' detected the 'warning' condition 'NSR is down because the retransmission threshold exceeded (probably because downstream RP is not healthy)'

There seems to be no Packet drops between the peers that could lead to above situation. We also recently upgraded to 5.3.3 but this problem still exists. The far end cannot has problem because it is not with just one or two peers.

Can anybody help what else to look for?

Cheers!!

xthuijs · ‎07-11-2016

correct, if nsr drops, the bgp session can still be ok. when nsr breaks the active rp will deal with the session directly. if there was a failover at that time, it wouldnt be stateful and bgp would likely reset.

so if you only see the nsr down message, you just lose redundancy.

if it comes with other bgp messages, it might mean that the remote peer is maybe not 100% functioning.

cheers

xander

View solution in original post

xthuijs · ‎06-30-2016

hi ali,

NSR doesnt require any support from the peer. NSR basically runs locally on the local systems by having the active RSP give its tcp packets for bgp to the standby and the standby transmit and receive the packets. this way the active and standby are always in sync.

if the system fails to get an ack, the active rp pulls back the control of the session and retransmits it

and then it seems like things are fine, but that leaves nsr broken since the standby is not tcp sync anymore.

various reasons can exist for this breakage, long xmit delays are possible especially with aggressive timers. we need to do more investigation here and a show tech tcp nsr would be a good starting point to figure out what is going wrong where.

possibly inside the system, or some timing issue whoknows. best to file a tac case for this one with the full logging, configuration and show tech tcp nsr.

cheers!

xander

Ali Muazzam · ‎07-11-2016

Hi Xander,

Thanks for the explanation. I would definitely open an SR with tac. Meanwhile, i would just like to know that if both RPs are not synched with each other over some BGP peer it wont effect actual BGP session?

What i understand is that NSR make both RPs (Active + Standby) deal with protocol packets to stay in synch states. But if there is any problem in synch, the actual protocol session shouldnt be disturbed right?

I am asking this because, I have more or less always seen BGP notification with previously mentioned syslog which is making me think that when synch breaks, it tears down the session too.

xthuijs · ‎07-11-2016

correct, if nsr drops, the bgp session can still be ok. when nsr breaks the active rp will deal with the session directly. if there was a failover at that time, it wouldnt be stateful and bgp would likely reset.

so if you only see the nsr down message, you just lose redundancy.

if it comes with other bgp messages, it might mean that the remote peer is maybe not 100% functioning.

cheers

xander

T J · ‎08-04-2016

Hey

I have the same problem, after I upgraded the ASR to 5.3.3. I started getting recursion loop looking prefix as well as NSR disabled, probably because the downstream RP is not healthy. Also the major alarm is on for the active rsp. I have opened a TAC case.

xthuijs · ‎08-04-2016

hi tj,

make sure you include the show tech tcp nsr in the case.

which will contain some good info for this.

and the show pfm loc all would help for the alarm details.

cheers!

xander

(pfm is platform fault manager)

T J · ‎08-04-2016

We upgraded from 4.3.2 to 5.3.3, I found that nsr is by default enabled for the 5.3.x versions. We do have nsr enabled. But previously we manually need to enable the nsr, but as that is by default enabled, will that cause a problem if i enable manually even for 5.3.x, as I copied the config from the old asr 9006 to new asr 9010.

xthuijs · ‎08-04-2016

hi tj,

yeah it doesn't do harm if you like to configure something :), but there is no need to do it specifically.

for BGP we enabled it by default in 53, others like ospf will follow.

see a note on it here.

cheers!

xander

T J · ‎08-04-2016

tcp[452]: %IP-TCP_NSR-5-DISABLED : X <-> X:: NSR disabled for TCP connection because Retransmission threshold exceeded

RP/0/RSP1/CPU0:Jul 31 11:40:19.711 : bgp[1058]: %ROUTING-BGP-5-NBR_NSR_DISABLED_STANDBY : NSR disabled on neighbor X on standby RP due to Peer closing down the session (VRF: X)

RP/0/RSP0/CPU0:Jul 31 11:40:19.716 : bgp[1058]: %ROUTING-BGP-3-NBR_NSR_DISABLED : NSR disabled on neighbor X due to 'ip-tcp' detected the 'warning' condition 'NSR is down because the retransmission threshold exceeded (probably because downstream RP is not healthy)'

Any idea?

xthuijs · ‎08-04-2016

yeah with NSR the tcp messages are transmitted by the standby RSP,

if the standby doesn't get a notification of a response, the primary RSP takes over, brakes NSR and sends it himself to maintain the tcp session.

since the primary sends his request to the tcp stack of the secondary over the fabric between the RSP's the problem here can be in the comm between RSP's or from a peer not responding in time (due to load etc).

one of the suspect things here is always the inter RSP communication (hence fabric path).

cheers

xander

T J · ‎08-04-2016

But what I understand, if it is a problem between the rsps' communication, then the flapping should occur with all neighbors. But I can see the problem with only 2 neighbors at the same time, another different time, and no flap with the rest of the neighbors. Why is that then?

If it the rsp's communication issue, then how to solve that? I see a major alarm on the active rsp, is it something related to it?

Thanks

xthuijs · ‎08-04-2016

show pfm location all would be important for that to see which alarms are raised.

if it is just with one or 2 peers and others are working fine, it signifies a possible issue with the peer potentially, like temporary spikes in cpu or mem utilization delaying its ack.

fabric errors could be looked at with:

show cotnroller fabric fia location 0/rspX/cpu0

this will look at the fia that is connected on the RSP to the fabric and see if there are in/out errors/drops etc.

sometimes with a high number of peers some punt policers may creep up, but then you see random peers experiencing the issue, if it is truly confined to 1-2 peers only then I'd be looking at them first.

if it is random spread between numerous peers, but not on all at the same time, it may be a punt issue caused by LPTS. this or random crc error hits on the fab link between RSP's.

xander

T J · ‎08-04-2016

Hi Xander

Thank you for your reply.

node: node0_2_CPU0
---------------------
CURRENT TIME: Aug 4 16:28:47 2016
PFM TOTAL: 8 EMERGENCY/ALERT(E/A): 0 CRITICAL(CR): 0 ERROR(ER): 8
-------------------------------------------------------------------------------
Raised Time |S#|Fault Name |Sev|Proc_ID|Dev/Path Name |Handle
--------------------+--+-----------------+---+-------+--------------+----------
Jul 30 01:27:35 2016|2 |DEV_SFP_PID_NOT_S|ER |483367 |SFP |0x1029000
Jul 30 01:27:36 2016|2 |DEV_SFP_SUPPORTED|ER |483367 |SFP |0x1029001
Jul 30 01:27:36 2016|2 |DEV_SFP_PID_NOT_S|ER |483367 |SFP |0x1029001
Jul 30 01:27:39 2016|2 |DEV_SFP_SUPPORTED|ER |483367 |SFP |0x1029003
Jul 30 01:27:39 2016|2 |DEV_SFP_PID_NOT_S|ER |483367 |SFP |0x1029003
Jul 30 01:27:42 2016|2 |DEV_SFP_SUPPORTED|ER |483367 |SFP |0x1029007
Jul 30 01:27:42 2016|2 |DEV_SFP_PID_NOT_S|ER |483367 |SFP |0x1029007
Jul 30 01:56:48 2016|2 |DEV_SFP_PID_NOT_S|ER |483367 |SFP |0x1029006

But all these errors are during the migration we did.

And there is "NONE" fault name under others.

in_error-0
To Xbar Uc Crc-0 0
To Xbar Uc Crc-1 0
To Xbar Uc Crc-2 0
To Xbar Uc Crc-3 0
To Xbar Mc Crc-0 0
To Xbar Mc Crc-1 0
To Xbar Mc Crc-2 0
To Xbar Mc Crc-3 0
nb pa read data err 0
pa header err 0
pa crc16 err 0
pa crc32 err 0
pa to tf err 0
ab overflow req lost 0
ni bad crc32 0
ni crc32 corrupt 0

eg_error-0
To Spaui Error-0 0
To Spaui Error-1 0
RL over/under flow cnt 0

I have checked the particular interfaces -no errors

Thanks

T J · ‎08-04-2016

One more point I want to add here is, on the interface I see 0 bit input-output for 5 mins, this is a backup path for us, but though there should be some bits of traffic.

Thanks

xthuijs · ‎08-04-2016

yeah those alarms are not suspicious per-se, they just mean that you have an optic that the system doesnt recognize and uses a standard driver. rarely ever this causes frame errors. this is what we need to check though considering that you are getting zero bits in, that is suspicious.

check the show controller of the interface, check the show controller np count for the attached npu and see if there is any relation to the loss of packets there. or maybe we are dropping due to bit nibbing or something...

if easy enough try to replace the optic if possible at all to eliminate that, especially if the two peers you have an issue with have the same optic kind.

cheers!

xander

BGP Flaps ASR9000