Solved: Re: GRE Tunnel reliability - Web of lies!

corycandia · ‎12-12-2020

Community,

I've read through about 10 posts talking about GRE and packet loss, everyone talks about MTU/MSS. Nothing I've seen/understood nothing to address the underlying packet loss and how to get the tunnel to reflect that. Can one of you router ninjas enlighten me?

Given: (US HQ) ISR 2921/15.4 and (German Branch) 819/15.4. (Yes, they're old)

German branch internet sucks half of every day, about 10%-15% packet loss. Other parts of day, it's fine. Example:

Sending 50, 1200-byte ICMP Echos to 199.27.XXX.XXX, timeout is 2 seconds:
!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!..!!!!!!!
Success rate is 92 percent (46/50), round-trip min/avg/max = 120/120/128 ms

Seen above, there is a bunch of packet loss, and transfers across the GRE definitely reflect that as they slow to a crawl. I've seen as much as 18% loss.

Now look at the tunnel info:

Tunnel0 is up, line protocol is up
Hardware is Tunnel
Description: DMVPN
Internet address is 172.20.254.3/24
MTU 17912 bytes, BW 30000 Kbit/sec, DLY 50000 usec,
reliability 255/255, txload 2/255, rxload 19/255

The GRE interface is being presented as perfect, with low traffic load, so this route stays in the routing table, even though it sucks hard at the time, and there's an 18mbps tunnel available with no packet loss.

Question 1: Is there any configuration that should be done to show the tunnel's actual condition? Are there some important commands I don't know about to get the tunnel to show it has 15% packet loss?

There's a bunch of ways to manipulate the routing table, I just don't know what's best to get routes to change based on condition since GRE hides the actual transport condition.

If I remove bandwidth, both tunnels (30mbps and 18mbps throughput tested) are equal composite.

I really don't want to hard code route selection because during no packet loss, 30 > 15.

During times of packet loss, 18 > 1 (the 30mbps tunnel turns into 1mbps)

I can't seems to figure out how to use any of the other EIGRP metrics to detect when the primary DMVPN is good or terrible.

Question 2: How should you configure EIGRP to deal with one of the paths being sometimes fastest/best, and sometimes worst depending on time of day and provider network conditions? Is there a configuration of traffic sharing that can be done so it maxes both?

(Ignore your first thought: "Talk to your provider/SLA, that's not an option, only route changes are an option unfortunately)

I attached some pieces of the branch config to snack on for this topic.

Thanks,

MHM Cisco World · ‎12-12-2020

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-embedded-event-manager-eem/113696-eem-tshoot-igp-00.html

this my suggestion friend, EEM i was have it but i search and couldn't found it,

if i found it i will also send it to you.

View solution in original post

Richard Burts · ‎12-12-2020

This presents quite a challenge. GRE/DMVPN is very difficult to evaluate for performance of the underlying transport. I wonder if some combination of IP SLA to detect response time problems/packet loss problems linked to EEM scripts to switch to the other tunnel might work for you?

HTH

Rick

View solution in original post

MHM Cisco World · ‎12-12-2020

cisco recommend MTU for DMVPN tunnel to be 1400

this prevent the router fragment and drop of packet

corycandia · ‎12-12-2020

Did you glance at the config attached? Unless I'm doing something wrong, ip mtu 1400 is already set.

The packet drop isn't just the tunnel, it's ALL traffic between the two sites, hence the reason I need the route to change internally.

MHM Cisco World · ‎12-12-2020

...

Leo Laohoo · ‎12-12-2020

@corycandia wrote:

German branch internet sucks half of every day

The VPN traffic turns to mush during business hours?

The router is an 819 but what is the WAN speed?
The 819 is a "dual" WAN: Wired or 3G/4G. What are you using?

Your attached output shows the dialer interface's counter was cleared >2w ago. Were there any "total output drops" before the counters were cleared?

corycandia · ‎12-12-2020

"The VPN traffic turns to mush during business hours? "

Yes, but it appears to only be the traffic headed across the public internet to the US. The other route I have in the 819 is a different tunnel to Microsoft Azure West Germany. Traffic between the branch and Azure experiences no loss at all, so that tunnel passes traffic very well. That's the tunnel I would like to take over during times of loss over the primary.

The router is an 819 but what is the WAN speed?
The 819 is a "dual" WAN: Wired or 3G/4G. What are you using?

The 819 is connected wired to PPPoE over fibre @ 100mbps. The router can get 100mbps headed straight to internet, but I shape the DMVPN down to 30mbps as the encryption taps the CPU out above that.

Your attached output shows the dialer interface's counter was cleared >2w ago. Were there any "total output drops" before the counters were cleared?

I saw no significant errors or drops at that interface, which in combination with traffic going one direction was fine, but another was not, led me to believe that the provider's network or leaving it was the issue.

I don't just use Azure as the primary route because the data moving through the peering between US and Germany costs money, the public internet DMVPN tunnel does not, but it sucks half the day

Leo Laohoo · ‎12-12-2020

@corycandia wrote:

The 819 is connected wired to PPPoE over fibre @ 100mbps. The router can get 100mbps headed straight to internet, but I shape the DMVPN down to 30mbps as the encryption taps the CPU out above that.

Traffic shape the entire link down to 20 Mbps and see if things improve.

Richard Burts · ‎12-12-2020

This presents quite a challenge. GRE/DMVPN is very difficult to evaluate for performance of the underlying transport. I wonder if some combination of IP SLA to detect response time problems/packet loss problems linked to EEM scripts to switch to the other tunnel might work for you?

HTH

Rick

MHM Cisco World · ‎12-12-2020

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-embedded-event-manager-eem/113696-eem-tshoot-igp-00.html

this my suggestion friend, EEM i was have it but i search and couldn't found it,

if i found it i will also send it to you.

corycandia · ‎12-13-2020

Thanks guys, on to learn a whole new component and scripting. Wishing this was build into the routing protocol.

Richard Burts · ‎12-13-2020

You wish for something built into the dynamic routing protocols. Our dynamic routing protocols are designed to choose a best path based on static characteristics of the path (bandwidth etc) and are not influenced by changing performance of the links. It occurs to me that the Cisco implementation of performance routing was designed to do just that. This link might help you get started looking into that feature, which might or might not be applicable for your situation.

https://www.cisco.com/c/en/us/products/ios-nx-os-software/performance-routing-pfr/index.html

HTH

Rick

Joseph W. Doherty · ‎12-14-2020

When using the Internet, as a "virtual link" between sites, besides using optimal MTU/MSS, it's generally very important to not over subscribe the available bandwidth.

First, if a physical interface offers more bandwidth than some logical CIR, shape for the CIR.

Second, if sites have different interface connections bandwidths, you should shape such that the "faster" side does not over run the "slower" side.

Third, if you logically have multi-point traffic, you need to also shape so that multiple sites, combined, do not over run the destination site.

Fourth, if an Internet connection is being used for both tunnel traffic and "raw" Internet traffic, the latter negates managing bandwidth with shapers. In these situations, it's best to use a different Internet connections for tunnel traffic and "raw" Internet traffic.

The third point, above, and/or "issues" within the Internet path, cannot be easily or fully addressed just using shaping. As others have mentioned, you can do much with EEM or perhaps using (as also mentioned by Rick) PfR (or whatever it's still called). I've used OER/PfR, and configured correctly, it's "magic". It can also dynamically load balance, answering your second question. But it can do more too.

At a fairly large international company, when I enabled OER/PfR, the "problem" it created was we could no longer "see" WAN performance issues. It would see the "problem" before our monitoring tools, and then route around it. We addressed this "problem" by both monitoring OER/PfR actions and by having test streams "bypassing" OER/PfR.

If using DMVPN, some of the later variants offer an Adaptive QoS feature, see https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos_plcshp/configuration/xe-16/qos-plcshp-xe-16-book/qos-plcshp-adaptive-qos-dmvpn.html

In theory, this would better allow usage of paths where available bandwidth keeps changing. I.e. it should, in theory, address the third and fourth points above.

corycandia · ‎12-15-2020

First, if a physical interface offers more bandwidth than some logical CIR, shape for the CIR.
the third and fourth points above.

I believe you're referring to something like this as an example:

policy-map POLICYMAP_ISP-SUB-LINE-RATE
class class-default
shape average 100000000
service-policy POLICYMAP_WAN-EGRESS-QUEUING

interface Dialer1
description PPPoE configuration
mtu 1492
bandwidth 100000
bandwidth inherit
ip address negotiated
ip nat outside
ip virtual-reassembly in
zone-member security INTERNET
encapsulation ppp
ip tcp adjust-mss 1452
delay 100
dialer pool 1
dialer-group 1
ppp authentication chap pap callin
ppp chap hostname XXX
ppp chap password XXX
ppp pap sent-username XXX
ppp ipcp route default
service-policy output POLICYMAP_ISP-SUB-LINE-RATE

Second, if sites have different interface connections bandwidths, you should shape such that the "faster" side does not over run the "slower" side.

I believe this implies something like DMVPN QoS where we can prevent the sender from tapping out the receivers inbound?:

interface Tunnel0
description DMVPN
bandwidth qos-reference 30000
ip address 172.20.254.3 255.255.255.0
no ip redirects
ip mtu 1400
ip pim sparse-dense-mode
ip nhrp authentication CNDMNTCS
ip nhrp network-id 1
ip nhrp holdtime 300
ip nhrp nhs dynamic nbma gateway.c3.candiamantics.com multicast
zone-member security LAN
ip tcp adjust-mss 1360
nhrp group DMVPN-QOS_GERMANY
nhrp map group DMVPN-QOS_100MBPS service-policy output POLICYMAP_DMVPN-100MBPS-LINE-RATE
nhrp map group DMVPN-QOS_60MBPS service-policy output POLICYMAP_DMVPN-60MBPS-LINE-RATE
qos pre-classify
tunnel source Dialer1
tunnel mode gre multipoint
tunnel key 1
tunnel protection ipsec profile IPSEC-PROFILE_DMVPN

Third, if you logically have multi-point traffic, you need to also shape so that multiple sites, combined, do not over run the destination site.

I think you were referring to the same as above.

This is what I ended up doing with EEM after a few people suggested EEM and then others corrected my script error:

event manager applet TUNNELBAND
event timer watchdog time 300 maxrun 120
action 010 cli command "enable"
action 020 cli command "ping gateway.c3.candiamantics.com source dialer1 repeat 50 size 1000"
action 040 regexp "Success rate is ([0-9]+) percent" "$_cli_result" match percent
action 045 puts "$match"
action 050 cli command "conf t"
action 060 cli command "router eigrp 1"
action 070 if $percent ge "98"
action 075 cli command "no offset-list 0 out 1000 Tunnel0"
action 076 cli command "no offset-list 0 in 1000 Tunnel0"
action 080 cli command "offset-list 0 out 1000 Tunnel50"
action 082 cli command "offset-list 0 in 1000 Tunnel50"
action 090 else
action 095 cli command "no offset-list 0 out 1000 Tunnel50"
action 096 cli command "no offset-list 0 in 1000 Tunnel50"
action 100 cli command "offset-list 0 out 1000 Tunnel0"
action 105 cli command "offset-list 0 in 1000 Tunnel0"
action 110 end

The offset propagate to the hub, which kept seeing the branch having two equal paths. Now it sees one good path depending on the condition of the line. Both paths use the same ISP, which is the cause of the issues at times of the day, as described above.

The third point, above, and/or "issues" within the Internet path, cannot be easily or fully addressed just using shaping. As others have mentioned, you can do much with EEM or perhaps using (as also mentioned by Rick) PfR (or whatever it's still called). I've used OER/PfR, and configured correctly, it's "magic". It can also dynamically load balance, answering your second question. But it can do more too.
At a fairly large international company, when I enabled OER/PfR, the "problem" it created was we could no longer "see" WAN performance issues. It would see the "problem" before our monitoring tools, and then route around it. We addressed this "problem" by both monitoring OER/PfR actions and by having test streams "bypassing" OER/PfR.
If using DMVPN, some of the later variants offer an Adaptive QoS feature, see https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos_plcshp/configuration/xe-16/qos-plcshp-xe-16-book/qos-plcshp-adaptive-qos-dmvpn.html
In theory, this would better allow usage of paths where available bandwidth keeps changing. I.e. it should, in theory, address the third and fourth points above.

That's super interesting about OER/PfR, thanks for mentioning that experience, makes sense. New things to research.

Thanks for the insight.

Joseph W. Doherty · ‎12-16-2020

You've understood, I believe, most of what I wrote, correctly. Except perhaps for the third point.

To recap though, and example of point 1 would be

USA T-1 <>cloud<> Europe E-1.

In this example, Europe can send "faster" than USA can receive, so we should shape Europe to USA's capacity. (BTW, why we shape would be so we can easily determine there's congestion going to USA, and if there is, we then have the option to "manage" it.)

An example of point 2.

USA 10Mbps Ethernet with a CIR of 5 Mbps <>cloud<>Europe E-1.

In this example USA can send "faster" than USA's CIR and Europe's E-1. Assuming there are other sites, USA should be shaped for both the CIR and each spoke that does not support USA's CIR rate. (BTW, not all platforms support multi-layer shaping [or at least they didn't].)

An example of point 3.

Site1 10Mbps Ethernet <>cloud
Site2 10Mbps Ethernet <>cloud
.
.
Site# 10Mbps Ethernet <>cloud

The combination of two or more sites sending to the same destination site, can overrun the destination site's capacity. In this situation, we could statically shape to insure that cannot happen, for example, if we just had 10 sites, each with 10 Mbps, we could shape each to 1 Mbps, but, of course, we "lose" 9 Mbps at each site. This is were something like Adaptive QoS can be helpful.

PfR won't help unless destination site had more than one link, then it could, in theory, load balance across those links, but once both links are filled, the problem remains.

Also BTW, with PfR, it can tie into QoS. So, for example, in your case where you have the two tunnels, if you keep one as the primary, I believe it can shift "important" traffic, alone, to a link not having drop issues (i.e. secondary), while leaving "unimportant" traffic on the "bad" (primary) path. (This assumes that the problem only occurs on individual tunnels when they are congested.)

Lastly, another advantage of PfR, vs. using your own EEM script, PfR tries to avoid the problem of making problems for itself by moving traffic flows around so rapidly that you just as rapidly move the problem. (I recall reading this is one reason EIGRP's additional routing metrics are generally not used, as it does this poorly.)