Solved: iBGP between SDWAN and Cisco Core flapping every 45 sec

sgogean79 · ‎08-18-2024

hello everyone,

we have a weird situation with BGP between two SDWAN routers (ASR1001X) and Distribution Core (C6824-X-LE-40G).

bare in mind that this iBGP was UP and Running since ~1 year before we did an IOS Code upgrade on SDWAN routers. same code upgrade was done on 6 routers in total, other 4 are working fine - BGP is fine - just those 2 in discussion are not. also the same equipment's we have in our Asia DC and there the BGP works fine.

(on SDWAN the code is 17.09.05 and on 6K it's 15.5(1)SY7)

now the weird part, even BGP is flapping every 45 sec, the 6K side does not learn any routes from SDWAN (like ~300 routes advertised) on the SDWAN side we're learning ~1.4K routes that Distribution advertises towards SDWAN. so in that short time, there are routes/packets exchanged, but learned only one way.

you would lean to say, look on your filters and routemaps, we did and they are the same on all 3 DC's, we even clear them up, re-applied, still no change on stability or route learning.

also you will say to look on the MTU, and in the bgp neighbor details we see that datagram was negotiated to 1468, and since there are routes learned on SDWAN side, we don't expect an MTU issue.

we did captures on SDWAN side, and we can clearly see BGP data exchanged properly, and we did captures on Dist side as well, we see TCP BGP traffic but not identified like BGP - you'll see in the screenshots. maybe 6K packet capture is different than the SDWAN packet capture.

SDWAN packet capture

6K Dist packet capture

(can someone clarify for me why the difference in the way the traffic is presented? could it be that on 6K side it was not bidirectional even we set it to be captured both ways)

so, did anyone encounter similars, and have ideeas, please share, as we tried almost everything, except reloading the 6K Distribution, we shut/unshut ports, reloaded ASR's, re-applied the respective node configuration, nothing worked.

thank you,

PS: packet captures are available here, if anyone sees anything, please share as I'm learning every day

(https://file.io/tsHRr3kt4WaE - not working anymore)

https://uploadnow.io/f/rwZnB0Y

sgogean79 · ‎08-19-2024

we've fixed the issue, was cause by MTU misalignment

from the reddit response I've posted....

first of all, thank you everyone for the ideas.

indeed everything points to an MTU issue, we I NEVER denied that, I just had hard time to see it.

what you have to keep in mind, is that before Saturday, like 19:15 CET, everything was working between SDWAN and 6K, BGP was UP and running for like 1 year almost, with daily traffic of terabytes passing over (not that it matter)

on SDWAN side, we have an MTU of 1508 on the portchannel, as the subinterface 2 (po10.2) has MTU of 1500

on 6K side, we had (as now is changed) the MTU of 1500 on the portchannel interface

looking on all the interfaces on 6K and routers, we did not see any errors or anything else that would point to a physical issue like others pointed/asked

anyway, what we think was the cause of the whole mess, was the Vlan2 MTU from the 6K where we had JUMBO frames (due to some ACI requirements) and we're supposing that because of that, the BGP that starts from an Vlan2 IP (10.4.2.10), would expect to go to the other side with frames close to JUMBO MTU that is was set.

still I can't explain why the BGP shows a datagram of 1468 bytes... I'll have to read on that and see the exact use of datagrams in BGP sessions....

BGP activity 126690/124284 prefixes, 1567755/1563786 paths, scan interval 30 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

10.4.2.11 4 65004 1522 1418 2577379 0 0 16:57:00 393

10.4.2.12 4 65004 1515 1427 2577379 0 0 16:56:36 399

bottom-line, it was indeed an MTU issue but it was "visible" outside the direct involved interfaces in the portchannel uplinks on 6K

the packet captures done on the 6K are showing the same like this (see the picture with 48 at the end, that is captured from 6K side) - like the 6k packets are SMALL (visible in the PCAP as well) - like under 60 bytes.

new screenshots/pictures: https://uploadnow.io/files/DLJrnnf

old 6K Dist packet capture

new pcap from the 6K side, after the MTU was changed is attached above for anyone interested

my/our expectations was to see failed negotiation, retransmissions, and other stuff and not all green.

like if you would see the capture from the SDWAN side, everything is perfect, or can someone point me to smth I'm missing there - that is why you do captures on both sides so you are not miss-leaded.

so, thank you all for supporting all the questions, and the dumb logic, but as this SDWAN to 6K connection *(BGP and other DC to sites traffic) worked fine for almost 1 year, we were not clear the MTU issue, untill we went back and did the logic of who starts packets from where and we determined that since 6K Vlan2 is with JUMBO frames, it will try to send larger packets. We set the 6K Portchannels towards SDWAN to JUMBO Frames, and since ~17hours we are OK .

View solution in original post

sgogean79 · ‎08-18-2024

Additional information.....

for debugs on 6K side see : https://pastebin.com/s2gHDXHB

“show ip bgp neighbor X” for both sides:

from 6K side: https://pastebin.com/4ciP1pVL
from SDWAN side: https://pastebin.com/QmVEm4qf

paul driver · ‎08-19-2024

Hello
Sounds like you have some recusive routing as such the ibgp sessions are dropping, how are you connecting the ibgp peers, loopback/directly interfaces?
Do you have a full mesh ibgp or using RR/Confeds etc...?
what igp is in use?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

sgogean79 · ‎08-19-2024

hey paul,

no recursive routing like you say

we're using direct fiber connections - same rack and everything - and traffic is initiated from Loopbacks....

sgogean79 · ‎08-19-2024