11-10-2022 11:50 AM
Hello,
We have been plagued for some time now with what seems like random occurrences with full dataplane loss on our cEdge devices.
All of these devices have MPLS connections as well as Broadband connections as primary transports with LTE configured as last resort.
We are often getting calls from remote sites stating they cannot reach anything. After my 1st level team tinkers for a bit and unplugs the wired WAN transports, eventually the site comes up on LTE and I can check logs. What I have been seeing is that out of nowhere there are messages that ALL BFD SESSIONS ARE LOST and trying to revert to Last Resort via LTE. Often the LTE does not pick up either until both the wired transports are disconnected.
I thought at first it may have been related to excessive packet loss as I see alot of anti-replay errors in the logs as well, typically over MPLS, however I see no indication otherwise of carrier issues. I can typically ping the PE fine with no loss, vManage is not recording any loss, and there is the fact that the transports are over different mediums most of the time (Fiber and Coax). We primarily steer RTP/VOIP traffic over MPLS and MSOffice traffic over Broadband, and everything else load balances to my understanding looking at our policies.
Looking at control connection history I can clearly see where the drops happen and the local reason states there is a vSmart timeout detected and no remote error. So does this mean the cEdge device is responsible for the drops?
Any idea what other items I can look for?
11-10-2022 01:46 PM
11-14-2022 05:53 AM
Thank you. I looked over the document and while the commands are different or not avail on Cisco, it gives me another set of things to look at.
I had an open case with Cisco that I was waiting on them to get on the phone and troubleshoot with me. After a few hours of review, we were inconclusive of what was causing the issue.
The current device I was using for troubleshooting is on 17.05.01a. I did not see any bug reports related to this issue.
Some things that I have noticed with our SDWAN environment that may or may not be related is that when there may be some problems with the Public internet circuits, is when we see control connection/bfd session issues and the devices go offline. Instead of keeping the traffic flowing through the MPLS circuits, things just go haywire and all dataplane traffic stops it seems.
I have also noticed a trend on these devices that there are quite a bit of ipsec anti replay errors. I am not sure if this is related to packetloss or QOS'ing reordering the packets?
11-16-2022 07:24 AM
I performed some additional recon and it appears the sites I am seeing this phenomena, both carriers happen to be using the same last mile fiber. So this explains why both wired transport connections have issues and we see BFD sessions just drop.
We have LTE configured to be the last resort if no BFD sessions are avail for I believe 7 seconds, however I think the primary transports are actively trying to create BFD sessions which is keeping the LTE from coming up. My team and myself will be looking into an alternate fashion to keep BFD/Control sessions up on LTE and just filter traffic from using LTE unless some other intervals are met. It seems Cisco may not have fully released an Alpha version of SLA tracking over TLS from what I have been advised from my co managed partner.
11-16-2022 12:21 PM
11-16-2022 12:31 PM
Agreed it seems to be transport reliability related. In the grand scheme of things it is not a large amount of sites with this issue. Our infrastructure is probably 1200ish sites. I think we are close to 400 migrated to SDWAN so far.
I see a couple of these per week though, which is an annoyance because it is giving the wrong impression of the value of SDWAN to alot of involved parties. We used SLA tracking to trigger to LTE for dynamic failover prior to this and seemed it was less a headache to be honest.
We do have a larger number of sites that are LTE only as well that I care to admit to. I hate leaning on LTE, but it does a decent job.
The intent with the design was to ONLY rely on LTE when it hits the fan and both wired transports are down. Had we had known how common it was for carriers to share the last mile fiber and asked these questions during vetting broadband carriers, we would probably not see this much, but we are just on such a large scale it is difficult to gauge.
That being said, with the ACL you are referring to, are you referring to this as a reactive approach?
11-16-2022 06:22 PM
11-16-2022 06:25 PM
11-21-2022 01:02 PM - edited 11-21-2022 01:12 PM
I think this solution would work great for the LTE to be like an out of band management only if I understand the logic correctly, but not so much for using as an actual SDWAN transport if the primary links go down like we have been seeing.
So if my logic is correct, I might be able to manage the device fine if the primary transports went down, but the users would still not be able to connect to Corp resources as no BFD sessions would be created correct? This would be treated as a public only connection out to external resources only correct? Or am I wrong?
BFD sessions are a requesite for the IPSEC tunnels to form correct?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide