Solved: Re: Change OSPF Hello/Dead Intervals on Existing DMVPN Enterprise - Page 2

WMA Hell · ‎07-11-2024

Hello

I know you are busy working real problems, but I want to bounce this off others to see if they successfully attempted to change the OSPF hello/dead intervals on existing DMVPN/NHRP/mGRE tunnels and provide feedback on the issues they had when they did it.

Though I have seen many enterprises eat hello/hold-dead timers/intervals prompting me to increase those values to NBMA levels to prevent routing flaps, I have landed at an enterprise that configured them LOWER (hello 2/dead than the OSPF defaults, 10/40. Normally, I would just change the timers/intervals as the tunnels stay up due to static routes, not reliant on a routing protocol, but I can't do that as we have DMVPN/mGRE. If I change the head-end intervals all the other tunnels as well as OSPF neighbors will go down. This is uncharted territory for me. ip ospf network point-to-multipoint is configured on the head-end tunnel.

The SLA with ISP is 8 seconds before they refund us for an outage. Our timers are set exactly to that interval so we could also have a situation with one hello timing out every acceptable WAN blip.

How the hell do I change the hello/dead intervals and not take down all NHRP/Dynamic mGRE tunnels?

Also, standard GRE tunnels don't have keepalives out of the box so a tunnel can seem UP/UP but be down. We fix this by adding Keepalive 10 3. I don't see this on the DMVPN/mGRE/NHRP tunnel. Do I need to add it? Can I add it? Will it flap NHRP?

MHM Cisco World · ‎07-15-2024

Router#show ip ospf neighbor tunnel detail

Router# show ip ospf retransmission-list tunnel

Router#Show ip ospf traffic

Share all, for show ip ospf traffic check nbr ignored counter is it increase or not with each neighbor flapping

MHM

WMA Hell · ‎07-16-2024

sh ip ospf traffic

Nbr Ignored is 541 on Tunnel x on branch. Unknown neighbor is 52

Nbr ignorged is o on Tunnel x on hub. Unknown neighbor is 8.

Show ip ospf retransmission-list tunnel x

It just shows all the neighbors, tunnel interface and address on head end and branch.

Show ip ospf neighbor tunnel x detail

Hub

Neighbor x.x.x.x, interface address x.x.x.x
In the area 0 via interface Tunnel100
Neighbor priority is 0, State is FULL, 6 state changes
DR is 0.0.0.0 BDR is 0.0.0.0
Options is 0x12 in Hello (E-bit, L-bit)
Options is 0x52 in DBD (E-bit, L-bit, O-bit)
LLS Options is 0x1 (LR)
Dead timer due in 00:00:07
Neighbor is up for 1w0d
Index 2/4/4, retransmission queue length 0, number of retransmission 5
First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
Last retransmission scan length is 1, maximum is 1
Last retransmission scan time is 0 msec, maximum is 0 msec
Neighbor x.x.x.x, interface address x.x.x.x
In the area 0 via interface Tunnel100
Neighbor priority is 0, State is FULL, 6 state changes
DR is 0.0.0.0 BDR is 0.0.0.0
Options is 0x12 in Hello (E-bit, L-bit)
Options is 0x52 in DBD (E-bit, L-bit, O-bit)
LLS Options is 0x1 (LR)
Dead timer due in 00:00:07
Neighbor is up for 10w2d
Index 1/3/3, retransmission queue length 0, number of retransmission 37
First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
Last retransmission scan length is 0, maximum is 1
Last retransmission scan time is 0 msec, maximum is 1 msec

Branch

Neighbor x.x.x.x, interface address x.x.x.x
In the area 0 via interface Tunnel100
Neighbor priority is 0, State is FULL, 6 state changes
DR is 0.0.0.0 BDR is 0.0.0.0
Options is 0x12 in Hello (E-bit, L-bit)
Options is 0x52 in DBD (E-bit, L-bit, O-bit)
LLS Options is 0x1 (LR)
Dead timer due in 00:00:07
Neighbor is up for 10w2d
Index 1/2/2, retransmission queue length 0, number of retransmission 122
First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
Last retransmission scan length is 1, maximum is 1
Last retransmission scan time is 0 msec, maximum is 1 msec

MHM Cisco World · ‎07-16-2024

Finally I get something here to share
OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 8,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

this from show ip ospf traffic you share (also there is nbr ignored, this until now dont find something about this counter)

##Core Issue##

The received error is a transient, or self-correcting, error message. The cause consists of flapping links, a change in the router ID of the neighboring router, or missed Database (DB) packets. This means that the router received a DB packet from a neighbor that was considered dead for one of the same reasons (flapping links, a change in the router ID of the neighboring router, or missed DB packets).

To find out the cause of the error, issue the log-neighbor-changes command under Open Shortest Path First (OSPF). If the error message occurs on an infrequent basis (every few months), the cause is usually link congestion, or a link that went down.

The CPU utilization increased due to the shortest path first (SPF) algorithm being run again.

Resolution

Although it is unlikely that you will know when you missed a packet, or when your link flaps, the log-neighbor-changes command can help you know when this occurs. Once this is accomplished, you can compare it with the times of the error messages, and figure out the problem.

Configure the log-neighbor-changes command under OSPF. This helps you understand what is taking place between the neighbors.

If this is occurring every few months, it is probably due to link congestion, or a link that no longer connects. Check the underlying Layer 2 topology. If that does not help, collect data from the technical support, and open a TAC Service Request with the Cisco Technical Assistance Center (TAC).

so there is two main cause
1- link flapping
this can check by run IP SLA and EEM and send syslog when the link is flapping and compare that with the neighbor ospf status change
2- link congestion
the cisco not recommend set BW in tunnel randomly, the tunnel BW sum must equal tunnel source BW (real)
@Joseph W. Doherty can help us here to check if there is packet drop in queue or not

thanks

MHM

Joseph W. Doherty · ‎07-16-2024

Rereading your, OP what you're asking for is a way to change OSPF times, across HQ and spokes, w/o any service interruption. If correct, that may be actually possible.

You've provided a possible way to accomplish that, which would be to use (floating) static routes to enable routing across the DMVPN tunnels while you reset OSPF interfaces hello settings.

Depending on your topology, doing this, might be a manual nightmare.

As an alternative, if you're willing to accept a service interruption of seconds (?); assuming your routers are using NTP, to synchronize their clocks, and also assuming all the routers support EEM, believe it may be possible to schedule an EEM script to run on all the routers at the same time to reconfigure the OSPF interface hello settings. The schedule time might be done when network activity is minimal or during some scheduled network maintenance.

Since you mention an outage SLA of 8 seconds, and since that only seems to be "known" by a legitimate OSPF peer drop, suggest you consider using a hello interval of either 1 or 2 seconds and a dead timer of 8 seconds. (The prior timers suggestion is also based on my understanding, you have no need for a quicker OSPF neighbor drop, or recovery speed.)

If you actually want the tunnels to go down too, or rely on them going down to also take down OSPF, as I referenced in an earlier reply, that might be done with, if supported, IPsec Dead Peer Detection. I don't recall if I've ever used that feature, and it too may have a similar issue as changing OSPF timers, i.e. service interruption, and length of service interruption. Again, scheduled EEM scripts might be the best way to minimize the service interruption.

Since both OSPF hello and/or IPsec Dead Peer Detection messages rely on being received to validate end-to-end connectivity, inadvertently losing such packets will have, as you've correctly described, a needless and nasty impact.

Unfortunately, when running across another network, about the best you can do is insure you're within your CIR like bandwidth allowances, and also insure, critical traffic, like OSPF hellos or IPsec Dead Peer Detection messages are not dropped, nor delayed, being sent into the transit network.

Oh, BTW, one painful aspect of DMVPN Phases 2 or 3, which allow spoke-to-spoke traffic, also permit multiple points sending to another point, causing congestion coming out of the transit network. (It's also a possible issue with DMVPN Phase 1, but for that, you can, at the hub, shape for each spoke, and at each spoke, shape that all the spokes aggregate will not overrun the hub.)

Lastly, another hint, often service providers provide "wire" equivalent bandwidth, but many Cisco shapers (or policers) don't allow for non-L3 overhead. I've found, shaping about 15% slower than CIR often stays under CIR (but not always). Also, some of the latest shaper implementations allow you to assign a fixed amount of overhead to each packet for shaping bandwidth consumption.

WMA Hell · ‎07-18-2024

Your are right about the static route to keep the tunnels up during OSPF change. The benefit of static GRE tunnels allow you to do that.

I like the idea of EEM as I rarely use that and would love to play with it. Is there a tempate EEM script for something like this?

Also, I was not aware of IPsec Dead Peer Detection.

Historically, I have always set the timers for NBMA 120/180, that way you may not drop due to a false positive because the cloud ate your hello packets and you won't inhibit notification of your connection as it will drop immediately becaue of the NHRP native connection "keepalives".

Since NTP synchronizes the clock stratum 2, I like the idea of pushing an EEM script for all Tunnel x on all NHRP hub and spoke devices, which total 5 devices, to configure IPsec dead Peer detection and Hello/dead timers.

I asked management if I could read the SLA for our ISP. We shall see if I get to do that.

Joseph W. Doherty · ‎07-18-2024

@WMA Hell wrote:

I like the idea of EEM as I rarely use that and would love to play with it. Is there a tempate EEM script for something like this?

Unfortunately, I don't have one, but these forums have lots of engineers who appear to like working with EEM scripts. So, if you post a (new) question on that, good chance you'll get lots of help.

@WMA Hell wrote:

Also, I was not aware of IPsec Dead Peer Detection.

I wrote earlier, didn't recall whether I've used IPsec Dead Peer Detection, but reviewing how to enable it, I do recall having used it, although not the actual cases. (Probably was to down an interface which SNMP often alarms to, more so than losing a routing peer.)

@WMA Hell wrote:

. . . all NHRP hub and spoke devices, which total 5 devices, to configure IPsec dead Peer detection and Hello/dead timers.

Personally, for only 5 devices, I would likely make the changes manually. (Although it's worthwhile learning to use EEM.)

Again, if you have floating static routes, you shouldn't have any lost of connectivity, but if you want to avoid the work/hassle (if much?) using floating static routes, you often run into the problem while changing the remote side's config you lose remote access. There are a couple of ways to mitigate that.

Firstly, since you should only be breaking OSPF, if your remote access from hub router to spoke's tunnel IP, as both routers have (logically) a shared directly connected network, remote access should continue to work. (Likely, you're already aware of this.)

Secondly, if you're doing a config change, of multiple statements, that during application, will break the connection, I've found, you can place on the config changes into a file, copy it onto the remote device's flash, and then, on the remote device, copy that file to the running config. (Your remote session may break, while the file's configuration statements are being applied, but they will all be applied.)

For "disaster" recovery backup, in the past, I've used to schedule reload in 5 to 15 minutes, right before I start messing with the remote device. On newer IOSs, believe they have (optional?) configuration rollback options, but don't recall if those can also be scheduled.

@WMA Hell wrote:

Historically, I have always set the timers for NBMA 120/180, that way you may not drop due to a false positive because the cloud ate your hello packets and you won't inhibit notification of your connection as it will drop immediately becaue of the NHRP native connection "keepalives"..

Again, fully appreciate the desire to avoid false loss of connectivity to lost control packets. Your approach does preclude the problem being directly caused by routing hellos, but if counting on any kind of packet based keep alives, lost of those packets is still a problem.

Having used WAN provider frame-relay, ATM and MPLS clouds, one of the most effective ways (usually) is to insure you're in conformance with your provider's bandwidths, both logical, like a CIR, and physical interface cloud connections.

Simple (pure) hub-spoke example, hub has FE handoff, with 50 Mbps CIR, and spokes have Ethernet hand-offs, with 10 Mbps CIR.

At the hub, you should insure you don't exceed 50 Mbps on the interface, and don't exceed more than 10 Mbps to any one spoke.

Spokes, together should either not exceed 50 Mbps (if CIR also applies to cloud=>hub traffic) or 100 Mbps. Of course, in this example, 5 or less 10 Mbps hubs (logically) cannot overrun hub, but if more than 5 (or 10) such spokes? (The latter is its own can of worms, many things you might do, but I don't want to digress even further.)

If you've working within your WAN bandwidth limits, which your traffic might congest, you also want to insure those packets are not dropped by you (remember, ideally if within cloud's bandwidth capabilities, none of the WAN packets should be dropped), which might be implicitly done by the earlier mentioned pak_priority. Again, pak_priority insures such packets are egress queued, but doesn't guarantee their timely forwarding (although unless you're using very tight dead timers, unlikely they will be so delayed, they will be considered lost).

Again, what's really important, is not hitting a WAN cloud provider's bandwidth restriction where they just "randomly" discard your packets.

MHM Cisco World · ‎07-18-2024

WMA Hell · ‎07-18-2024

I see there is an EEM script in the works. I like the idea of playing with EEM as I have never used it. I wonder if I could just leverage SolarWinds for this?

MHM Cisco World · ‎07-18-2024

But it not solve issue

It to detect ospf is down because wan is flapping or not.

You will use eem and use ip sla packet to detect neighbor' ip sla send in periodic ospf hello is less than it' and even what EEM can do ?

So eem I use is for detect wan down/flapping and contact SP inform then about this issue.

I also mention two commands

Show interface

Show policy-map interface

To detect if drop is from queue congestion or not.

I.e. we need to know it router issue or wan sp issue

MHM