Re: BGP Unstable

Shahid Ishaq · ‎07-09-2024

Hello,

I keep observing the following in router logs,

I have removed BFD config and the BGP timers, but the BGP session keeps flapping, any idea what else I can check.

Jul 9 08:46:11 router.id rpd[8119]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer xx.xx.xx.xx (External AS XXXXX) changed state from Established to Idle (event HoldTime) (instance CIRCUIT ID)
Jul 9 08:46:21 router.id rpd[8119]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer xx.xx.xx.xx (External AS XXXXX) changed state from EstabSync to Established (event RsyncAck) (instance CIRCUIT ID)
Jul 9 08:47:51 router.id rpd[8119]: %DAEMON-4-BGP_IO_ERROR_CLOSE_SESSION: BGP peer xx.xx.xx.xx (External AS XXXXX): Error event Operation timed out(60) for I/O session - closing it
Jul 9 08:47:51 router.id rpd[8119]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer xx.xx.xx.xx (External AS XXXXX) changed state from Established to Idle (event HoldTime) (instance CIRCUIT ID)
Jul 9 08:47:51 router.id rpd[8119]: %DAEMON-4: bgp_io_mgmt_cb:1964: NOTIFICATION sent to xx.xx.xx.xx (External AS XXXXX): code 4 (Hold Timer Expired Error), Reason: holdtime expired for xx.xx.xx.xx (External AS XXXXX), socket buffer sndcc: 4992 rcvcc: 0 TCP state: 4, snd_una: 798711463 snd_nxt: 798713023 snd_wnd: 16321 rcv_nxt: 3177571163 rcv_adv: 3177587547, hold timer 90s, hold timer remain 0s, last sent 9s, TCP port (local 179, remote 30072), JSR handle (primary 94506082400, secondary 90194313216)
Jul 9 08:47:58 router.id rpd[8119]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer xx.xx.xx.xx (External AS XXXXX) changed state from EstabSync to Established (event RsyncAck) (instance CIRCUIT ID)

Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
XX.XX.XX.XX XXXXX 2 42 0 860 1:08 Establ

MHM Cisco World · ‎07-09-2024

Most of this issue is due to MTU

Check MTU

MHM

Shahid Ishaq · ‎07-09-2024

I

I have checked tha already, but I would expect to see MTU mismatch in the logs which I don't

ROUTER.ID> ping XX.XX.XX.XX interface aeXX.XXXX size 1540
PING XX.XX.XX.XX (XX.XX.XX.XX): 1540 data bytes
1548 bytes from XX.XX.XX.XX: icmp_seq=0 ttl=255 time=3.915 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=1 ttl=255 time=2.762 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=2 ttl=255 time=2.538 ms

MHM Cisco World · ‎07-09-2024

This mtu I think' check time it is high high latency'

Do ping but this time with df-bit set' this give use exact issue here

MHM

Shahid Ishaq · ‎07-09-2024

ROUTER.ID> ping XX.XX.XX.XX interface aeXX.XXXX size 1540 do-not-fragment
PING XX.XX.XX.XX (XX.XX.XX.XX): 1540 data bytes
1548 bytes from XX.XX.XX.XX: icmp_seq=0 ttl=255 time=2.054 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=1 ttl=255 time=1.918 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=2 ttl=255 time=2.042 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=3 ttl=255 time=1.991 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=4 ttl=255 time=1.908 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=5 ttl=255 time=1.980 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=6 ttl=255 time=2.975 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=7 ttl=255 time=1.920 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=8 ttl=255 time=2.032 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=9 ttl=255 time=2.020 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=10 ttl=255 time=1.937 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=11 ttl=255 time=2.001 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=12 ttl=255 time=2.008 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=13 ttl=255 time=1.954 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=14 ttl=255 time=2.089 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=15 ttl=255 time=2.051 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=16 ttl=255 time=1.940 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=17 ttl=255 time=2.081 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=18 ttl=255 time=2.029 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=19 ttl=255 time=1.905 ms
1548 bytes from XX.XX.XX.XX: icmp_seq=20 ttl=255 time=1.995 ms
^Z1548 bytes from XX.XX.XX.XX: icmp_seq=21 ttl=255 time=2.059 ms
^C
--- XX.XX.XX.XX ping statistics ---
22 packets transmitted, 22 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.905/2.040/2.975/0.211 ms

MHM Cisco World · ‎07-09-2024

Check cpu in both router there is high latency, it can drop bgp packet before get ack.

MHM

Shahid Ishaq · ‎07-09-2024

Hello @MHM Cisco World

This is a Juniper MX104

ROUTER.ID> show chassis routing-engine
Routing Engine status:
Slot 0:
Current state Backup
Election priority Master (default)
Temperature 42 degrees C / 107 degrees F
CPU temperature 49 degrees C / 120 degrees F
DRAM 3968 MB (4096 MB installed)
Memory utilization 22 percent
5 sec CPU utilization:
User 4 percent
Background 0 percent
Kernel 2 percent
Interrupt 0 percent
Idle 94 percent
Model RE-MX-104
Serial ID CAEY6684
Start time 2021-12-03 01:25:59 UTC
Uptime 949 days, 9 hours, 2 minutes, 23 seconds
Last reboot reason 0x200:normal shutdown
Load averages: 1 minute 5 minute 15 minute
0.20 0.12 0.07
Routing Engine status:
Slot 1:
Current state Master
Election priority Backup (default)
Temperature 41 degrees C / 105 degrees F
CPU temperature 44 degrees C / 111 degrees F
DRAM 3968 MB (4096 MB installed)
Memory utilization 30 percent
5 sec CPU utilization:
User 6 percent
Background 0 percent
Kernel 9 percent
Interrupt 0 percent
Idle 85 percent
1 min CPU utilization:
User 8 percent
Background 0 percent
Kernel 9 percent
Interrupt 0 percent
Idle 83 percent
5 min CPU utilization:
User 20 percent
Background 0 percent
Kernel 20 percent
Interrupt 1 percent
Idle 59 percent
15 min CPU utilization:
User 19 percent
Background 0 percent
Kernel 20 percent
Interrupt 1 percent
Idle 61 percent

Shahid Ishaq · ‎07-09-2024

ROUTER.ID> show chassis routing-engine
Routing Engine status:
Slot 0:
Current state Backup
Election priority Master (default)
Temperature 42 degrees C / 107 degrees F
CPU temperature 49 degrees C / 120 degrees F
DRAM 3968 MB (4096 MB installed)
Memory utilization 22 percent
5 sec CPU utilization:
User 4 percent
Background 0 percent
Kernel 2 percent
Interrupt 0 percent
Idle 94 percent
Model RE-MX-104
Serial ID CAEY6684
Start time 2021-12-03 01:25:59 UTC
Uptime 949 days, 9 hours, 2 minutes, 23 seconds
Last reboot reason 0x200:normal shutdown
Load averages: 1 minute 5 minute 15 minute
0.20 0.12 0.07
Routing Engine status:
Slot 1:
Current state Master
Election priority Backup (default)
Temperature 41 degrees C / 105 degrees F
CPU temperature 44 degrees C / 111 degrees F
DRAM 3968 MB (4096 MB installed)
Memory utilization 30 percent
5 sec CPU utilization:
User 6 percent
Background 0 percent
Kernel 9 percent
Interrupt 0 percent
Idle 85 percent
1 min CPU utilization:
User 8 percent
Background 0 percent
Kernel 9 percent
Interrupt 0 percent
Idle 83 percent
5 min CPU utilization:
User 20 percent
Background 0 percent
Kernel 20 percent
Interrupt 1 percent
Idle 59 percent
15 min CPU utilization:
User 19 percent
Background 0 percent
Kernel 20 percent
Interrupt 1 percent
Idle 61 percent

Harold Ritter · ‎07-09-2024

Hi @Shahid Ishaq ,

As @MHM Cisco World mentioned, these BGP instability issues are most of the time related to a maximum segment size (MSS).

Can you provided the output for the "show tcp detail pcb" command from the IOS-XR side:

"show tcp brief" and identify the PCB using the source and destination address.

"show tcp detail pcb <pcb identified in the previous command> | i MSS"

Regards,

Harold Ritter
Sr Technical Leader
CCIE 4168 (R&S, SP)
harold@cisco.com
México móvil: +52 1 55 8312 4915
Cisco México
Paseo de la Reforma 222
Piso 19
Cuauhtémoc, Juárez
Ciudad de México, 06600
México