Solved: Strange BGP issue

CliveG · ‎03-23-2023

I have two data centres and one of them connects upstream and receives the full internet routing table this is then forwarded via iBGP to the other Data-Centre (Don't worry about if this is good practice or not, it is configuration I have inherited and can do nothing about for now).

With no change of configurations and no network changes, suddenly the holdown timers are expring and this connection is constantly up/down because of the peer resets.

Weirdly, we are able to ping devices connected to this DC but cannot pass any other traffic. Here is what I am now seeing in the error log and was hoping someone could point me in the right direction:

Mar 23 12:31:46.216: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:32:42.903: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:33:03.944: %LDP-SW1-5-SP: 192.168.1.1 :0: session recovery succeeded
Mar 23 12:33:28.085: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up
Mar 23 12:39:44.601: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:40:31.532: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:40:55.721: %LDP-SW1-5-SP: 192.168.1.1:0: session recovery succeeded
Mar 23 12:41:28.127: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up

I have confirmed the X-Connect is good and have also replaced the sfp's. I am planning on changing out the core switches as it could be hardware or an ios issue, but I am hoping I do not have to.

Thanks

MHM Cisco World · ‎03-26-2023

the IGP is effect iBGP, the IGP is change when the link down and IGP select path through the DC2
as I mention before check the IGP and your share of traceroute and your previous comment confirm my theory.
the issue of packet pass through the DC 1or2
here the issue CoPP can accept specific packet size and rate,
the CoPP is drop the BGP and hence the BGP flapping always

the solution you must check the IGP. if you solve IGP then the BGP will solve automatic.

View solution in original post

Harold Ritter · ‎03-23-2023

Hi @CliveG ,

This might be due to a different maximum segment size being used on each ibgp neighbor.

Can you please provide the "sh bgp ipv4 unicast nei <ibgp neighbor address> | incl max data segment" output from both ibgp neighbors.

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-23-2023

Hi Harold,

The two routers have currently gone into an idle state, and this is also part of the problem. So I cannot get that information currently as it is blank. With all other connections I am getting, for example:

ar 23 13:43:47.160: %BGP-SW1-3-NOTIFICATION: sent to neighbor 192.168.1.2 4/0 (hold time expired) 0 bytes
Mar 23 13:43:47.160: %BGP-SW1-5-NBR_RESET: Neighbor 192.168.1.2 reset (BGP Notification sent)
Mar 23 13:43:47.160: %BGP-SW1-5-ADJCHANGE: neighbor 192.168.1.2 Down BGP Notification sent
Mar 23 13:43:47.160: %BGP_SESSION-SW1-5-ADJCHANGE: neighbor 192.168.1.2 L2VPN Vpls topology base removed from session BGP Notification sent
Mar 23 13:43:47.160: %BGP_SESSION-SW1-5-ADJCHANGE: neighbor 192.168.1.2 VPNv4 Unicast topology base removed from session BGP Notification sent
Mar 23 13:43:47.160: %BGP_SESSION-SW1-5-ADJCHANGE: neighbor 192.168.1.2 IPv4 Unicast topology base removed from session BGP Notification sent
Mar 23 13:43:53.416: %BGP-SW1-5-ADJCHANGE: neighbor 192.168.1.2 Up
Mar 23 13:46:53.877: %BGP-SW1-5-NBR_RESET: Neighbor 192.168.1.2 reset (Peer closed the session)
Mar 23 13:46:53.877: %BGP-SW1-5-ADJCHANGE: neighbor 192.168.1.2 Down Peer closed the session

On this session the segment size is the same both ends:

Datagrams (max data segment is 7936 bytes):

It is a weird one that I do not think I will get an answer to until I do some changes tomorrow.

Thank you

Harold Ritter · ‎03-23-2023

Hi @CliveG ,

Thanks for the additional information. It would be helpful to collect the "sh bgp ipv4 uni nei <ibgp neighbor address> | i path-mtu" ouput as well.

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-23-2023

Hi Harold,

This is enabled on both sides as per below:

Transport(tcp) path-mtu-discovery is enabled

Harold Ritter · ‎03-23-2023

Hi @CliveG ,

Since the max data segment is 7936 bytes, can you make sure you can ping from one ibgp peer to the other with a packet size of 7976 (full packet size = 7936 + 20 bytes (TCP header) + 20 bytes (IP header))

ping <ibgp peer address> size 7976 df-bit (perform from both ibgp peers)

Also make sure you specify the source address if the ibgp session is configured using the loopback interface.

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-24-2023

This fails from both sides. No response with that size packet.

If I ping normally then it works fine.

Harold Ritter · ‎03-24-2023

Hi @CliveG ,

This is the issue then. BGP thinks it can send 7936 bytes as the TCP payload and it does not seem to be supported by the underlay. Something might have change in the underlay. Normally this should be detected by the path mtg discovery, but it looks like it doesn't. Something might be blocking the ICMP packet too big messages.

Can you try to find what is the largest packet you can send with df-bit set.

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-24-2023

Hi Harold,

Okay, so the maximum I can send is 707. Anything above that and it fails. That seems very low.

Thanks

Harold Ritter · ‎03-24-2023

Hi @CliveG ,

This is definitely wrong. From you diagram, it looks like the two devices are directly connected from a layer 3 perspective. The underlay has to be the issue then.

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-24-2023

Sorry. I am a bit lost there. So this could be the upstream provider reconnection then? Incorrectly switched back on maybe?

Harold Ritter · ‎03-24-2023

@CliveG , the big question is why are you able to ping with a packet size of 707 if the interface mtu is 7976? It has to be something wrong in the underlay.

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-24-2023

The reason I ask this is because, again, there has been no change in configurations or IOS since it was functioning perfectly. So how can this occur?

Harold Ritter · ‎03-24-2023

@CliveG , something might have changed in the underlay (service provider network).

Regards,

Regards,
Harold Ritter, CCIE #4168 (EI, SP)

CliveG · ‎03-24-2023

This is exactly what I am thinking. we only started seeing this issue when the connections were accidentally dropped and then re-connection took place.