08-18-2024 07:02 AM - edited 08-18-2024 08:41 AM
hello everyone,
we have a weird situation with BGP between two SDWAN routers (ASR1001X) and Distribution Core (C6824-X-LE-40G).
bare in mind that this iBGP was UP and Running since ~1 year before we did an IOS Code upgrade on SDWAN routers. same code upgrade was done on 6 routers in total, other 4 are working fine - BGP is fine - just those 2 in discussion are not. also the same equipment's we have in our Asia DC and there the BGP works fine.
(on SDWAN the code is 17.09.05 and on 6K it's 15.5(1)SY7)
now the weird part, even BGP is flapping every 45 sec, the 6K side does not learn any routes from SDWAN (like ~300 routes advertised) on the SDWAN side we're learning ~1.4K routes that Distribution advertises towards SDWAN. so in that short time, there are routes/packets exchanged, but learned only one way.
you would lean to say, look on your filters and routemaps, we did and they are the same on all 3 DC's, we even clear them up, re-applied, still no change on stability or route learning.
also you will say to look on the MTU, and in the bgp neighbor details we see that datagram was negotiated to 1468, and since there are routes learned on SDWAN side, we don't expect an MTU issue.
we did captures on SDWAN side, and we can clearly see BGP data exchanged properly, and we did captures on Dist side as well, we see TCP BGP traffic but not identified like BGP - you'll see in the screenshots. maybe 6K packet capture is different than the SDWAN packet capture.
(can someone clarify for me why the difference in the way the traffic is presented? could it be that on 6K side it was not bidirectional even we set it to be captured both ways)
so, did anyone encounter similars, and have ideeas, please share, as we tried almost everything, except reloading the 6K Distribution, we shut/unshut ports, reloaded ASR's, re-applied the respective node configuration, nothing worked.
thank you,
PS: packet captures are available here, if anyone sees anything, please share as I'm learning every day
(https://file.io/tsHRr3kt4WaE - not working anymore)
https://uploadnow.io/f/rwZnB0Y
Solved! Go to Solution.
08-19-2024 01:59 AM
we've fixed the issue, was cause by MTU misalignment
from the reddit response I've posted....
first of all, thank you everyone for the ideas.
indeed everything points to an MTU issue, we I NEVER denied that, I just had hard time to see it.
what you have to keep in mind, is that before Saturday, like 19:15 CET, everything was working between SDWAN and 6K, BGP was UP and running for like 1 year almost, with daily traffic of terabytes passing over (not that it matter)
on SDWAN side, we have an MTU of 1508 on the portchannel, as the subinterface 2 (po10.2) has MTU of 1500
on 6K side, we had (as now is changed) the MTU of 1500 on the portchannel interface
looking on all the interfaces on 6K and routers, we did not see any errors or anything else that would point to a physical issue like others pointed/asked
anyway, what we think was the cause of the whole mess, was the Vlan2 MTU from the 6K where we had JUMBO frames (due to some ACI requirements) and we're supposing that because of that, the BGP that starts from an Vlan2 IP (10.4.2.10), would expect to go to the other side with frames close to JUMBO MTU that is was set.
still I can't explain why the BGP shows a datagram of 1468 bytes... I'll have to read on that and see the exact use of datagrams in BGP sessions....
BGP activity 126690/124284 prefixes, 1567755/1563786 paths, scan interval 30 secs
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.4.2.11 4 65004 1522 1418 2577379 0 0 16:57:00 393
10.4.2.12 4 65004 1515 1427 2577379 0 0 16:56:36 399
bottom-line, it was indeed an MTU issue but it was "visible" outside the direct involved interfaces in the portchannel uplinks on 6K
the packet captures done on the 6K are showing the same like this (see the picture with 48 at the end, that is captured from 6K side) - like the 6k packets are SMALL (visible in the PCAP as well) - like under 60 bytes.
new screenshots/pictures: https://uploadnow.io/files/DLJrnnf
new pcap from the 6K side, after the MTU was changed is attached above for anyone interested
my/our expectations was to see failed negotiation, retransmissions, and other stuff and not all green.
like if you would see the capture from the SDWAN side, everything is perfect, or can someone point me to smth I'm missing there - that is why you do captures on both sides so you are not miss-leaded.
so, thank you all for supporting all the questions, and the dumb logic, but as this SDWAN to 6K connection *(BGP and other DC to sites traffic) worked fine for almost 1 year, we were not clear the MTU issue, untill we went back and did the logic of who starts packets from where and we determined that since 6K Vlan2 is with JUMBO frames, it will try to send larger packets. We set the 6K Portchannels towards SDWAN to JUMBO Frames, and since ~17hours we are OK .
08-18-2024 07:14 AM
Additional information.....
for debugs on 6K side see : https://pastebin.com/s2gHDXHB
“show ip bgp neighbor X” for both sides:
08-19-2024 01:23 AM
Hello
Sounds like you have some recusive routing as such the ibgp sessions are dropping, how are you connecting the ibgp peers, loopback/directly interfaces?
Do you have a full mesh ibgp or using RR/Confeds etc...?
what igp is in use?
08-19-2024 02:01 AM
hey paul,
no recursive routing like you say
we're using direct fiber connections - same rack and everything - and traffic is initiated from Loopbacks....
08-19-2024 01:59 AM
we've fixed the issue, was cause by MTU misalignment
from the reddit response I've posted....
first of all, thank you everyone for the ideas.
indeed everything points to an MTU issue, we I NEVER denied that, I just had hard time to see it.
what you have to keep in mind, is that before Saturday, like 19:15 CET, everything was working between SDWAN and 6K, BGP was UP and running for like 1 year almost, with daily traffic of terabytes passing over (not that it matter)
on SDWAN side, we have an MTU of 1508 on the portchannel, as the subinterface 2 (po10.2) has MTU of 1500
on 6K side, we had (as now is changed) the MTU of 1500 on the portchannel interface
looking on all the interfaces on 6K and routers, we did not see any errors or anything else that would point to a physical issue like others pointed/asked
anyway, what we think was the cause of the whole mess, was the Vlan2 MTU from the 6K where we had JUMBO frames (due to some ACI requirements) and we're supposing that because of that, the BGP that starts from an Vlan2 IP (10.4.2.10), would expect to go to the other side with frames close to JUMBO MTU that is was set.
still I can't explain why the BGP shows a datagram of 1468 bytes... I'll have to read on that and see the exact use of datagrams in BGP sessions....
BGP activity 126690/124284 prefixes, 1567755/1563786 paths, scan interval 30 secs
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.4.2.11 4 65004 1522 1418 2577379 0 0 16:57:00 393
10.4.2.12 4 65004 1515 1427 2577379 0 0 16:56:36 399
bottom-line, it was indeed an MTU issue but it was "visible" outside the direct involved interfaces in the portchannel uplinks on 6K
the packet captures done on the 6K are showing the same like this (see the picture with 48 at the end, that is captured from 6K side) - like the 6k packets are SMALL (visible in the PCAP as well) - like under 60 bytes.
new screenshots/pictures: https://uploadnow.io/files/DLJrnnf
new pcap from the 6K side, after the MTU was changed is attached above for anyone interested
my/our expectations was to see failed negotiation, retransmissions, and other stuff and not all green.
like if you would see the capture from the SDWAN side, everything is perfect, or can someone point me to smth I'm missing there - that is why you do captures on both sides so you are not miss-leaded.
so, thank you all for supporting all the questions, and the dumb logic, but as this SDWAN to 6K connection *(BGP and other DC to sites traffic) worked fine for almost 1 year, we were not clear the MTU issue, untill we went back and did the logic of who starts packets from where and we determined that since 6K Vlan2 is with JUMBO frames, it will try to send larger packets. We set the 6K Portchannels towards SDWAN to JUMBO Frames, and since ~17hours we are OK .
08-19-2024 06:10 AM
hello
thanks for the update- mtu was definitely a possibility but as you stated that had all been checked and verified as okay - I was thinking of additional root causes but good to know you’ve found the root cause as mtu and shared that with the community
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide