Re: Issues with with all BFD sessions dropping on cEdge devices

pietro manicioto · ‎11-10-2022

Hello,

We have been plagued for some time now with what seems like random occurrences with full dataplane loss on our cEdge devices.

All of these devices have MPLS connections as well as Broadband connections as primary transports with LTE configured as last resort.

We are often getting calls from remote sites stating they cannot reach anything. After my 1st level team tinkers for a bit and unplugs the wired WAN transports, eventually the site comes up on LTE and I can check logs. What I have been seeing is that out of nowhere there are messages that ALL BFD SESSIONS ARE LOST and trying to revert to Last Resort via LTE. Often the LTE does not pick up either until both the wired transports are disconnected.

I thought at first it may have been related to excessive packet loss as I see alot of anti-replay errors in the logs as well, typically over MPLS, however I see no indication otherwise of carrier issues. I can typically ping the PE fine with no loss, vManage is not recording any loss, and there is the fact that the transports are over different mediums most of the time (Fiber and Coax). We primarily steer RTP/VOIP traffic over MPLS and MSOffice traffic over Broadband, and everything else load balances to my understanding looking at our policies.

Looking at control connection history I can clearly see where the drops happen and the local reason states there is a vSmart timeout detected and no remote error. So does this mean the cEdge device is responsible for the drops?

Any idea what other items I can look for?

svemulap@cisco.com · ‎11-10-2022

Couple of things could be happening.
Take a look at: https://www.cisco.com/c/en/us/support/docs/routers/sd-wan/214510-troubleshoot-bidirectional-forwarding-de.html
Even though, it is specific to vEdge, will provide useful information around cEdge too.

Also, one other note. Didn't get which release the devices are on. Cross check for any known bugs if any.
You can alos open a case with TAC, to get further assistance.

HTH.

pietro manicioto · ‎11-14-2022

Thank you. I looked over the document and while the commands are different or not avail on Cisco, it gives me another set of things to look at.

I had an open case with Cisco that I was waiting on them to get on the phone and troubleshoot with me. After a few hours of review, we were inconclusive of what was causing the issue.

The current device I was using for troubleshooting is on 17.05.01a. I did not see any bug reports related to this issue.

Some things that I have noticed with our SDWAN environment that may or may not be related is that when there may be some problems with the Public internet circuits, is when we see control connection/bfd session issues and the devices go offline. Instead of keeping the traffic flowing through the MPLS circuits, things just go haywire and all dataplane traffic stops it seems.

I have also noticed a trend on these devices that there are quite a bit of ipsec anti replay errors. I am not sure if this is related to packetloss or QOS'ing reordering the packets?

pietro manicioto · ‎11-16-2022

I performed some additional recon and it appears the sites I am seeing this phenomena, both carriers happen to be using the same last mile fiber. So this explains why both wired transport connections have issues and we see BFD sessions just drop.

We have LTE configured to be the last resort if no BFD sessions are avail for I believe 7 seconds, however I think the primary transports are actively trying to create BFD sessions which is keeping the LTE from coming up. My team and myself will be looking into an alternate fashion to keep BFD/Control sessions up on LTE and just filter traffic from using LTE unless some other intervals are met. It seems Cisco may not have fully released an Alpha version of SLA tracking over TLS from what I have been advised from my co managed partner.

svemulap@cisco.com · ‎11-16-2022

Looks like the fundamental problem seems to be the underlying transport which is not reliable.
LTE has its own challenges. Not sure how much BW you have on these circuits and that they can handle production data traffic on daily basis.
Not just for failover scenarios.

It this is not an issue for LTE, until transport issues are resolved, you can configure ACL to block BFD data connection(s) to come up.
Just sharing some thoughts.

pietro manicioto · ‎11-16-2022

Agreed it seems to be transport reliability related. In the grand scheme of things it is not a large amount of sites with this issue. Our infrastructure is probably 1200ish sites. I think we are close to 400 migrated to SDWAN so far.

I see a couple of these per week though, which is an annoyance because it is giving the wrong impression of the value of SDWAN to alot of involved parties. We used SLA tracking to trigger to LTE for dynamic failover prior to this and seemed it was less a headache to be honest.

We do have a larger number of sites that are LTE only as well that I care to admit to. I hate leaning on LTE, but it does a decent job.

The intent with the design was to ONLY rely on LTE when it hits the fan and both wired transports are down. Had we had known how common it was for carriers to share the last mile fiber and asked these questions during vetting broadband carriers, we would probably not see this much, but we are just on such a large scale it is difficult to gauge.

That being said, with the ACL you are referring to, are you referring to this as a reactive approach?

svemulap@cisco.com · ‎11-16-2022

The ACL is configured on the (local) node which doesn't let BFD data-path to get established.
ACL would be to the remote-Public-IP address and denies it. It could be granular too,
For example, based on the port, etc.

The key is that you want the Control to be UP / UP on this interface / TLOC, but just don't want BFD Session to be established.
Not sure if you have hub-n-spoke design, or regional mesh.

A (pseudo) sample config would be:

policy
access-list deny-BFD-IN
seq 10 match
action deny
default-action accept

It is important to have: default-action accept at the end. It will allow all the other traffic other than what is defined in seq 10 in the above example.

Once configured, apply the ACL IN-bound on the local device WAN Transport, where you don't want to have a BFD session, due to underlay issues.

If you want to implement, please make sure to verify it on a lab / test node.

HTH.

svemulap@cisco.com · ‎11-16-2022

with the correct ACL

A (pseudo) sample config would be:

policy
access-list deny-BFD-IN
seq 10 match
action deny
default-action accept

pietro manicioto · ‎11-21-2022

I think this solution would work great for the LTE to be like an out of band management only if I understand the logic correctly, but not so much for using as an actual SDWAN transport if the primary links go down like we have been seeing.

So if my logic is correct, I might be able to manage the device fine if the primary transports went down, but the users would still not be able to connect to Corp resources as no BFD sessions would be created correct? This would be treated as a public only connection out to external resources only correct? Or am I wrong?

BFD sessions are a requesite for the IPSEC tunnels to form correct?