Solved: Change OSPF Hello/Dead Intervals on Existing DMVPN Enterprise

WMA Hell · ‎07-11-2024

Hello

I know you are busy working real problems, but I want to bounce this off others to see if they successfully attempted to change the OSPF hello/dead intervals on existing DMVPN/NHRP/mGRE tunnels and provide feedback on the issues they had when they did it.

Though I have seen many enterprises eat hello/hold-dead timers/intervals prompting me to increase those values to NBMA levels to prevent routing flaps, I have landed at an enterprise that configured them LOWER (hello 2/dead than the OSPF defaults, 10/40. Normally, I would just change the timers/intervals as the tunnels stay up due to static routes, not reliant on a routing protocol, but I can't do that as we have DMVPN/mGRE. If I change the head-end intervals all the other tunnels as well as OSPF neighbors will go down. This is uncharted territory for me. ip ospf network point-to-multipoint is configured on the head-end tunnel.

The SLA with ISP is 8 seconds before they refund us for an outage. Our timers are set exactly to that interval so we could also have a situation with one hello timing out every acceptable WAN blip.

How the hell do I change the hello/dead intervals and not take down all NHRP/Dynamic mGRE tunnels?

Also, standard GRE tunnels don't have keepalives out of the box so a tunnel can seem UP/UP but be down. We fix this by adding Keepalive 10 3. I don't see this on the DMVPN/mGRE/NHRP tunnel. Do I need to add it? Can I add it? Will it flap NHRP?

MHM Cisco World · ‎07-16-2024

Finally I get something here to share
OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 8,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

this from show ip ospf traffic you share (also there is nbr ignored, this until now dont find something about this counter)

##Core Issue##

The received error is a transient, or self-correcting, error message. The cause consists of flapping links, a change in the router ID of the neighboring router, or missed Database (DB) packets. This means that the router received a DB packet from a neighbor that was considered dead for one of the same reasons (flapping links, a change in the router ID of the neighboring router, or missed DB packets).

To find out the cause of the error, issue the log-neighbor-changes command under Open Shortest Path First (OSPF). If the error message occurs on an infrequent basis (every few months), the cause is usually link congestion, or a link that went down.

The CPU utilization increased due to the shortest path first (SPF) algorithm being run again.

Resolution

Although it is unlikely that you will know when you missed a packet, or when your link flaps, the log-neighbor-changes command can help you know when this occurs. Once this is accomplished, you can compare it with the times of the error messages, and figure out the problem.

Configure the log-neighbor-changes command under OSPF. This helps you understand what is taking place between the neighbors.

If this is occurring every few months, it is probably due to link congestion, or a link that no longer connects. Check the underlying Layer 2 topology. If that does not help, collect data from the technical support, and open a TAC Service Request with the Cisco Technical Assistance Center (TAC).

so there is two main cause
1- link flapping
this can check by run IP SLA and EEM and send syslog when the link is flapping and compare that with the neighbor ospf status change
2- link congestion
the cisco not recommend set BW in tunnel randomly, the tunnel BW sum must equal tunnel source BW (real)
@Joseph W. Doherty can help us here to check if there is packet drop in queue or not

thanks

MHM

View solution in original post

Joseph W. Doherty · ‎07-16-2024

Rereading your, OP what you're asking for is a way to change OSPF times, across HQ and spokes, w/o any service interruption. If correct, that may be actually possible.

You've provided a possible way to accomplish that, which would be to use (floating) static routes to enable routing across the DMVPN tunnels while you reset OSPF interfaces hello settings.

Depending on your topology, doing this, might be a manual nightmare.

As an alternative, if you're willing to accept a service interruption of seconds (?); assuming your routers are using NTP, to synchronize their clocks, and also assuming all the routers support EEM, believe it may be possible to schedule an EEM script to run on all the routers at the same time to reconfigure the OSPF interface hello settings. The schedule time might be done when network activity is minimal or during some scheduled network maintenance.

Since you mention an outage SLA of 8 seconds, and since that only seems to be "known" by a legitimate OSPF peer drop, suggest you consider using a hello interval of either 1 or 2 seconds and a dead timer of 8 seconds. (The prior timers suggestion is also based on my understanding, you have no need for a quicker OSPF neighbor drop, or recovery speed.)

If you actually want the tunnels to go down too, or rely on them going down to also take down OSPF, as I referenced in an earlier reply, that might be done with, if supported, IPsec Dead Peer Detection. I don't recall if I've ever used that feature, and it too may have a similar issue as changing OSPF timers, i.e. service interruption, and length of service interruption. Again, scheduled EEM scripts might be the best way to minimize the service interruption.

Since both OSPF hello and/or IPsec Dead Peer Detection messages rely on being received to validate end-to-end connectivity, inadvertently losing such packets will have, as you've correctly described, a needless and nasty impact.

Unfortunately, when running across another network, about the best you can do is insure you're within your CIR like bandwidth allowances, and also insure, critical traffic, like OSPF hellos or IPsec Dead Peer Detection messages are not dropped, nor delayed, being sent into the transit network.

Oh, BTW, one painful aspect of DMVPN Phases 2 or 3, which allow spoke-to-spoke traffic, also permit multiple points sending to another point, causing congestion coming out of the transit network. (It's also a possible issue with DMVPN Phase 1, but for that, you can, at the hub, shape for each spoke, and at each spoke, shape that all the spokes aggregate will not overrun the hub.)

Lastly, another hint, often service providers provide "wire" equivalent bandwidth, but many Cisco shapers (or policers) don't allow for non-L3 overhead. I've found, shaping about 15% slower than CIR often stays under CIR (but not always). Also, some of the latest shaper implementations allow you to assign a fixed amount of overhead to each packet for shaping bandwidth consumption.

View solution in original post

MHM Cisco World · ‎07-11-2024

You mixing many topic in one post

1- dmvpn use nhrp to check health of dmvpn

If-state nhrp

Use nhrp message to detect up down of tunnel'

Keepalive not support in multi point gre (dmvpn)

2-many case I see about dmvpn flapping is due to traffic is congestion in queue and hence control traffic like opsf hello is drop

So sure run QoS and give priority to control traffic will solve some flapping issue

MHM

Joseph W. Doherty · ‎07-11-2024

@MHM Cisco World wrote:

2-many case I see about dmvpn flapping is due to traffic is congestion in queue and hence control traffic like opsf hello is drop

So sure run QoS and give priority to control traffic will solve some flapping issue

Yup, protecting OSPF traffic, using QoS, from congestion issues can go a long way in precluding flaps.

Unfortunately, it's tricky enough to setup optimal QoS on original DMVPN (Phase 1), and a QoS nightmare in the later Phases (2 and 3) of DMVPN, because they support spoke-to-spoke traffic.

(What's makes DMVPN QoS tricky, you can have congestion going into the "cloud", which isn't usually much of a QoS configuration issue, but also can have congestion exiting the "cloud". For the latter, you don't usually have any direct "cloud" egress QoS management, and since the "cloud" egress can be coming from multiple "cloud" ingress points, there no easy way to coordinate multiple "cloud" ingress being sent, concurrently, to the same "cloud" egress.)

Joseph W. Doherty · ‎07-11-2024

Update!

Working another posting, about DMVPN and QoS, I've "discovered" that DMVPN Phase 2 and 3 may now be able to use spoke-to-spoke QoS (using Adaptive QoS [a DMVPN feature I was aware of, but didn't know it could be used for spoke-to-spoke]).

See Per-Tunnel QoS for Spoke to Spoke Connections section immediately above where the link takes you.

Adaptive QoS, monitors available bandwidth remote side of a tunnel, and shapes near side outbound to available bandwidth. This, in theory, allows prioritization of traffic.

I've not used any Adaptive QoS, but the two issues I can see easily arise in a multipoint topology, the rate adjustment might not adjust quick enough, for very sensitive traffic, and/or there's insufficient bandwidth to support your QoS prioritization needs.

For the latter, I don't see it being sufficiently deterministic to guarantee priority traffic the bandwidth it may need. That's not saying it won't work, just not the guarantee I believe should be provided by QoS.

WMA Hell · ‎07-11-2024

Hi MHM

I am surprised to hear you say routing control traffic like OSPF is being dropped. I was always under the influence that Cisco routers have an internal protection of control traffic. I never see the tunnel drop but I see OSPF neighbors drop due to dead timer expiration.

We do have QoS on our DMVPN tunnels.

At Hub

policy-map XYZ

Class class-default

Average Rate Traffic Shaping

Cir 4 000 000 0

service policy child_xyz

child is voice and signaling which isn't a problem

at Data Center

Policy Map XYZ

class class-default

Average Rate Traffic Shaping

cir 1 000 000 000

service policy XYZ

Joseph W. Doherty · ‎07-11-2024

@WMA Hell wrote:

I am surprised to hear you say routing control traffic like OSPF is being dropped. I was always under the influence that Cisco routers have an internal protection of control traffic. I never see the tunnel drop but I see OSPF neighbors drop due to dead timer expiration.

I presume you have pak_priority in mind. If so, reading the reference I just provided, it's supposed to guarantee such packets won't be egress queued dropped, but doesn't guarantee they won't be delayed. (Further, if you're crossing a provider's "cloud" pak_priority doesn't exist, and such a packet might be dropped on an internal "cloud" interface, especially if you oversubscribe a provider's CIR.)

Further, I could not find any mention of pak_priority handling when you're dealing with tunnels.

I also found this old document Configure a Queuing Strategy for Routing Packets, which, at least on that (old) platform, does mention control (pak_priority marked?) packets might be dropped.

I presume, but cannot say for a fact, a properly tailored explicit QoS policy, giving special treatment to OSPF hello packets, might be a better approach vs. relying on pak_priority.

(Personally, I suspect pak_priority was a Cisco approach to insure selected control packets got their own special treatment without any manual configuration requirement - especially useful with a single, default, egress queue. [I.e. one of Cisco's little tweaks to increase network stability without violating any RFC, I believe. Also the kind of thing which is why often Cisco devices just seemed to work better than Brand X.])

WMA Hell · ‎07-11-2024

Thanks for the response. I just looked at my Netflow Sources in Solarwinds as well as the CBQoS Policy Details and QoS thresholds haven't been reached in months.

So, the daily OSPF hellos are being dropped some unknown place. I think increasing the timers will fix the issue but I never did that on a DMVPN tunnel. Above someone said NHRP checks the state of the tunnel but why are hellos and intervals configured on the tunnel?

So, what is the plan to increase them on the hub and spokes?

Joseph W. Doherty · ‎07-11-2024

@WMA Hell wrote:

Thanks for the response. I just looked at my Netflow Sources in Solarwinds as well as the CBQoS Policy Details and QoS thresholds haven't been reached in months.

So, the daily OSPF hellos are being dropped some unknown place. I think increasing the timers will fix the issue but I never did that on a DMVPN tunnel. Above someone said NHRP checks the state of the tunnel but why are hellos and intervals configured on the tunnel?

So, what is the plan to increase them on the hub and spokes?

Well not knowing what you actual QoS policy is, and much more about the underlying infrastructure, cannot say what value, if any, your Solarwinds stats have.

Yup, certainly possible you're losing OSPF hellos across/within your tunnel, as it transits whatever its running across.

Cannot comment much on increasing hello times, since my OSPF hello timers changes have usually been to get the neighbor to drop ASAP with reduced times, ideally, sub-second. (NB: this because I usually supported networks with redundant paths and "critical" traffic, like VoIP. Didn't want such traffic being black holed if there was an alternative good path.)

Well, as to why there's keepalive timers for both a tunnel and a routing protocol, because neither really knows what other parts of the network might be doing. Personally, when dealing with both, we normally counted on the routing protocol being the important element for path being up or not, but lots of basic monitoring software will alarm on a interface "down" and not a routing protocol lost neighbor. So, we tried to insure both (routing and tunnel) correctly reflected the down path, but, again, for different reasons. (For routing, we wanted to switch to an alternative path, ideally, quickly enough no network apps dropped their connectivity, and for the tunnel, itself, just wanted it to alarm our SNMP monitoring, but no urgency on that being super fast.)

So, if you don't have an alternative path, and traffic is going to be black holed, you should be able to "safely" increase timers and/or lost hello count. If tunnel drops first, it should notify routing anyway.

That said, doing the forgoing might come back to bite you if you do put in an alternate path, when you then do want to minimize the time traffic is directed to a black hole. So, I would suggest, trying to find exactly why OSPF is dropping the neighbor.

Possibly, much of what should be done to use DMVPN well, isn't being done. And/or, your "cloud" has its own issues, which your "cloud" provider is unaware. (NB: if there are "cloud" issues, be prepared to prove to them they have a problem. [My all time favorite story, had one "cloud" connection, which I didn't think was working quite right, but it was hard to prove as the delta about how I expected it to work and how it was working, was very small. Provider believed I just didn't understand impact of things like microbursting and shaping parameters. I thought otherwise, and complained about the one link for months. So much so, provider actually complained to my management to get me to stop complaining. My management said, if Joe thinks there's an issue, there may be. Well, again after months, provider found problem. It was caused by a bug in a port's firmware, which hardware vendor had fixed sometime back, but updated firmware hadn't be applied to provider's particular hardware. Turned out, when consistent and on-going port utilization started to hit 100%, port would, incorrectly, drop about 1% of transit traffic. Provider updated firmware, I then saw throughput which I expected.])

WMA Hell · ‎07-12-2024

Thank you for following up.

Nobody in the US needs sub second awareness of their path even with RTP like VOICE and VIDEO. OSPF doesn't provide sub second but BFD does. When the tunnel goes down OSPF goes down immediately anyway. And when the tunnel comes back up OSPF converges immediately, particularly on small networks with plenty of processor power.

Now the OSPF neighbor went down due to Dead Timer Expired on July 9 at 11:24 EST. The Tunnel did not go down. I don't see Keepalive configured on the tunnel so the tunnel may of went down but the router may not be recording its true state w/o the Keepalive. I don't know how that works with NHRP. I don't have a lab. I just started and the institutional knowledge are busy.

The QoS is just shaping out the tunnel with a default class CIR of 40Mb/s with RTP with guaranteed bandwidth in the event the interface is saturated.

I don't see the interfaces hitting 100% in any view so QoS isn't kicking in.

The CBQoS Netflow Interface Details view in Solarwinds shows the ingress and egress variables for the last 24 hours. On the head end end we have the default class Cir for 40Mb/s and the end destination having issues here has a default class CIR of 1 Gb/s with a threshold of 70%.

The Tunnel Transmit (TX) throughput at the headend for the last 12 months averaged TX 26Mb/s with a spike up to 295Mb/s on April 26, 2024.

The Tunnel Transmit (TX) throughput at the farend for the last 12 months averaged TX 58Mb/s with a spike to 112Mb/s on Jan 15.

No input/output errors on either tunnel.

Tunnels are not Tx enough to saturate the Tunnel and initiate QoS so this isn't' a QoS problem.

Sh ip route shows the distant IP address of the Tunnel to be found by OSPF via NHRP not static routes so changing the timers on the distant end will drop the tunnel taking it offline.

The OSPF timers need to be changed back to the defaults, at least, to see if the hellos aren't dropped but I can' think of a slick way to do this. I am asking mgmt for the ISP SLA to see what allowed ms delay is allowed in the contract before reimbursement happens. I think it may be around 8s which is exactly our dead timer interval.

I read an old Cisco forum thread that said Keepalives aren't supported with ISAKMP configured GRE tunnels. IPSEC sends its own link state messages.

Any ideas?

Joseph W. Doherty · ‎07-12-2024

@WMA Hell wrote:

Thank you for following up.

Nobody in the US needs sub second awareness of their path even with RTP like VOICE and VIDEO. OSPF doesn't provide sub second but BFD does. When the tunnel goes down OSPF goes down immediately anyway. And when the tunnel comes back up OSPF converges immediately, particularly on small networks with plenty of processor power.

Now the OSPF neighbor went down due to Dead Timer Expired on July 9 at 11:24 EST. The Tunnel did not go down. I don't see Keepalive configured on the tunnel so the tunnel may of went down but the router may not be recording its true state w/o the Keepalive. I don't know how that works with NHRP. I don't have a lab. I just started and the institutional knowledge are busy.

The QoS is just shaping out the tunnel with a default class CIR of 40Mb/s with RTP with guaranteed bandwidth in the event the interface is saturated.

I don't see the interfaces hitting 100% in any view so QoS isn't kicking in.

The CBQoS Netflow Interface Details view in Solarwinds shows the ingress and egress variables for the last 24 hours. On the head end end we have the default class Cir for 40Mb/s and the end destination having issues here has a default class CIR of 1 Gb/s with a threshold of 70%.

The Tunnel Transmit (TX) throughput at the headend for the last 12 months averaged TX 26Mb/s with a spike up to 295Mb/s on April 26, 2024.

The Tunnel Transmit (TX) throughput at the farend for the last 12 months averaged TX 58Mb/s with a spike to 112Mb/s on Jan 15.

No input/output errors on either tunnel.

Tunnels are not Tx enough to saturate the Tunnel and initiate QoS so this isn't' a QoS problem.

Sh ip route shows the distant IP address of the Tunnel to be found by OSPF via NHRP not static routes so changing the timers on the distant end will drop the tunnel taking it offline.

The OSPF timers need to be changed back to the defaults, at least, to see if the hellos aren't dropped but I can' think of a slick way to do this. I am asking mgmt for the ISP SLA to see what allowed ms delay is allowed in the contract before reimbursement happens. I think it may be around 8s which is exactly our dead timer interval.

I read an old Cisco forum thread that said Keepalives aren't supported with ISAKMP configured GRE tunnels. IPSEC sends its own link state messages.

Any ideas?

Hmm, much to comment on . . .

"Nobody in the US needs sub second awareness of their path even with RTP like VOICE and VIDEO."

Nobody, really? My user bases not only didn't want a VoIP call to drop, but ideally they did not want to lose even one syllable during a VoIP call or for video conferencing to have screen freezes, pixilation, etc. Personally, I consider SONET having, more-or-less, set the "gold" standard of detecting an outage within 50 ms.

"OSPF doesn't provide sub second but BFD does."

Depends whether OSPF supports the Fast Hello feature. The reference begins with: "The OSPF Support for Fast Hello Packets feature provides a way to configure the sending of hello packets in intervals less than 1 second. Such a configuration results in faster convergence in an Open Shortest Path First (OSPF) network."

Of course, that's immediately followed by "Note

It is recommended to use Bidirectional Forwarding Detection (BFD) instead of Fast Hello Packets."

Which I agree with, if BFD is supported in OSPF.

(BTW, I recall [?] OSPF Fast Hellos was initially either supported earlier, or more often, than BFD. I've used both.)

"And when the tunnel comes back up OSPF converges immediately, particularly on small networks with plenty of processor power."

Well, guess much depends on how you define "immediate" and "small". (BTW, the closest I know that OSPF does "immediate" is when router has multiple equal cost egress paths to the destination, unless you're also thinking of LFA [background, of that reference, advances a need for fastest convergence for modern apps] and/or FFR [Fast Reroute]) Much also depends, on whether you've "tuned" OSPF for faster convergence (niffy way to change OSPF options without getting down into the weeds), what OSPF options you've enabled (if available or whether now the default in later Cisco OSPF implementations, like Cisco's iSPF).

Remember, a drop/break in an OSPF path, is just the trigger to begin convergence, and some of Cisco's proprietary OSPF "enhancements", by using their defaults, can very much slow such convergence (as they were designed to improve OSPF stability (OSPF meltdowns are not pretty, which I've seen on non-Cisco OSPF devices). (Oh, also keep in mind, some of Cisco's OSPF stability enhancements, run on timers, so a more powerful router doesn't help.)

"I don't see the interfaces hitting 100% in any view so QoS isn't kicking in."

If I only had a dollar for every time I've heard that. (Laugh, actually I worked many years as a contracted network engineer, so I actually did better than getting a dollar to improve network performance issues, at least as seen by users, but which was "invisible" to network engineering staff based on typical network monitoring.)

Insufficient information to say whether QoS is, or isn't kicking in, for your setup, but very often you cannot tell by just examination of typical network monitoring stats.

Again, though, you have the issue of OSPF dropping its neighbor, across a tunnel. Lots and lots of things can cause that, including things on the underlying transport you have no direct control over. But, very likely, if everything was working as it should, or as expected, you shouldn't be seeing this as a problem, unless it's something as simple as the underlying path is dropping all your traffic for the hello dead interval.

Assuming the latter, then is your concern you want the tunnel to go down at the same time as OSPF "sees" the break? This so you can get your SLA failure adjustment?

I haven't reread all the replies, but has IPsec Dead Peer Detection Periodic Message Option been considered?

WMA Hell · ‎07-15-2024

Wow, can you provide authoritative documentation proving your claim that QoS kicks in when the interface isn't 100% saturated? I can see where a default CIR has set shaping at percentage, like we have here. I assume that is what you are mocking me for. When I look at the output of show policy-map multipoint Tunnel X and I see some Tunnels have some total drops (queue limit) of the class-default and ZERO b/w exceeded drops on the daughter policies for RTP.

You don't need any of those timers so tight because you get a false negative when OSPF goes down because of said timers. Also, if the link flaps OSPF converges at the same time then reconverges when a hello is received by the neighbor who sends it immediately when the link comes up.

You don't need timers to support voice as the timers will unnecessarily flap your network when the tunnel or transport didn't flap.

The arrogance you exude is unnecessary. Especially when you think you right and not.

Joseph W. Doherty · ‎07-15-2024

"Wow, can you provide authoritative documentation proving your claim that QoS kicks in when the interface isn't 100% saturated?"

That's not what I wrote. Which was:

"I don't see the interfaces hitting 100% in any view so QoS isn't kicking in."

If I only had a dollar for every time I've heard that.  (Laugh, actually I worked many years as a contracted network engineer, so I actually did better than getting a dollar to improve network performance issues, at least as seen by users, but which was "invisible" to network engineering staff based on typical network monitoring.)

Insufficient information to say whether QoS is, or isn't kicking in, for your setup, but very often you cannot tell by just examination of typical network monitoring stats.

Hmm, I assume you consider I'm personally mocking you because of the way I described, so many times, networking engineering staff infers the wrong thing from "typical network monitoring stats". Well, actually mocking you wasn't my intent, still isn't, but such has been my experience as to misinterpretation of typical network monitoring stats.

I'm going to further explain, which I really don't know will benefit you if you've closed your mind to learning, but it may benefit other readers.

Anyway, the key point of a load percentage it's a capacity measurement over some time period.

Consider a "typical" measurement period of 5 minutes. If a port continuously transmitted, i.e. frames back-to-back, for 5 minutes, and we measured bandwidth usage for the same 5 minutes, this would be reported at 100% utilization.

If, though, during a 10 minute period, if there was only the same 5 minutes of traffic, load would be reported, for the 10 minutes as 50%. Or, if we had two 5 minute blocks, back to back, and the 5 minutes transmission consumed the last 2 minutes of the first 5 minute measurement period, and the first 3 minutes of the second measurement periods, the respective load stats would be 40% and 60%.

Or, going back to a 10 minutes measurement period, if the actual transmission of data was continuous for 1 minute, stopped for the next minute, and repeated, the overall 10 minute load would be, again, reported as 50%.

Hopefully, there's no confusion about the forgoing, just laying ground work as we get into more complicated measurements.

Now, next consider we have gig ingress and FE egress. Ingress access back-to-back frames for 6 seconds, then stops. Since gig is 10x the "speed" of FE, what happens. Well, if there's no queuing, the FE will also transmit for 6 seconds and drop 90% of the traffic. (BTW, the load percentage for the FE will only reflect the 6 seconds of usage.)

The forgoing is far from ideal, so we add a queue to the FE interface, lets say enough to queue 6 seconds of gig traffic. It will take the FE interface a full minute to transmit all the received traffic. If that's all the traffic that's transmitted during a 5 minute measurement period, the load stat will show 20%. Remember though, we had to queue traffic for 54 seconds.

What QoS operates on, is queued traffic (excluding policers, and shapers). So, in the forgoing QoS can kick in, as we've queued traffic, but a 5 minute load stat only shows 20% utilization!

Certainly traffic was queued while the interface was running at maximum rate, but my comment was directed toward assuming QoS is not being engaged just because a load stat doesn't show 100% isn't always correct. In my experience, I've seen this interpretation over and over and over again.

BTW, the converse common interpretation that a continuous load of 100%, or some other "high" value is always bad, is incorrect too, but I'm not going to discuss that.

Possibly, a more important factoid is, even when traffic is queued, and you have a QoS policy defined, not all the traffic is QoS processed!

Early on my wrestling with effective QoS, I wasn't seeing the results I expected. Well, stumbled across this TechNote, which explains how QoS policies apply to traffic that overflows the interface hardware FIFO queue. Although that TechNote is relative to ATM, it's not just an ATM interface consideration. Further, some later documentation notes how some later IOS versions started to reduce the tx-ring-limit if a QoS interface policy was being used.

Unfortunately, much QoS material often glosses over considerations like the above. For example, you already knew of the impact of tx-ring-limit size to a QoS policy, right?

Or, how RED/WRED is the next best thing to sliced bread, right? Heck, it's shown in most Cisco QoS examples, and I recall (?) AutoQoS incorporates it too. (Something else I wrestled with early on. It can be useful, but I suggest only QoS experts use it. Getting it work well is considerably more complicated than Cisco's QoS documentation makes out.)

"You don't need any of those timers so tight because you get a false negative when OSPF goes down because of said timers."

Hasn't been a problem for me, but minimizing black holes has been an important design consideration.

For example, using OSPF and this topology: Rtr<>Sw<>Sw<>Rtr, how do you detect if the inter swtiches link goes down? OSPF hellos will detect it, but, assuming you do have an alternative path, is a 40 second outage okay? What to you consider an acceptable outage time?

For you, not my call, do whatever you, or they want. For me, however, users I've worked with, didn't want a 40 second outage with ordinary apps, let alone VoIP or VidConf.

Is sub second needed? Again, depends on your service needs. It was you who stated nobody in US needs subsecond. Well, again, my user base felt otherwise. I never stated everyone should have subsecond. Actually my users didn't state they wanted subsecond, what they wanted was hitless (such that they didn't know there was any network issue at all).

Of course, if I had something like Rtr<>Rtr and could count on hardware downing the link, we didn't much care about OSPF hello intervals. But, every now and then, we couldn't count on hardware, like the Rtr<>Sw<...>Sw<>Rtr kind of topology.

(As a side note, had a case where had Rtr SX<fiber>LX Rtr. I thought, what!? Turned out, under the covers my fiber was actually connected to an optical network. When it broke, my ports still stayed up. Initially thought would need fast[er] OSFP hellos, but then found optical network could, if there was a break, take down end connections. [Just had to ask, or remind, optical team to enable that feature.])

I fully agree, you don't want OSPF taking down a neighbor, needlessly! Again, though, never had that as an issue, even using subsecond OSPF hellos. (If you insure OSPF hellos are prioritized, and not dropped, then it should work. However, you do have to allow for the underlying transport. Don't see this an issue on the LAN, and private WAN clouds [that meet their promised SLAs], Internet, though, is a different matter. [Oh, and yes, we've been discussing your DMVPN tunnel, but don't recall recommending you should be using subsecond hellos.])

"The arrogance you exude is unnecessary. Especially when you think you right and not."

Yea, been accused of that before. Although arrogance usually also considers its based on an unfounded superiority, which would be true if I as wrong as you believe. However, I don't believe I'm wrong, but instead, perhaps you can, gracefully point out my errors.

I believe I've answered your question showing how QoS can engage when typical network monitoring does not show 100% load.

Again, fully agree with you we don't want OSPF hello timers incorrectly dropping a neighbor. Where we differ, is you believe it cannot be done with subsecond hellos, and I've stated I've done it without an issue (unless you also believe I'm a liar too). Anyway, with OSPF Fast Hellos, a router can struggle with it, and so I don't recall ever doing it on more than one interface, and encountered issues if the timer went below 250 to 300 ms. Using BFD, it works well across multiple interfaces but recall it seems you don't want to go below 100-150 ms.

Again, where possible, much prefer to rely hardware detection or hardware redundancy (e.g. Etherchannel).

MHM Cisco World · ‎07-11-2024

Mr @Joseph W. Doherty can help you with QoS I am not so good in QoS

For check if ospf hello drop by queue

Do debug ip ospf hello

If one side send and other side not receive hello and tunnel is UP/UP stable

Then check drop in tunnel interface and tunnel source interface

This give us some hint that ospf hello is drop duo to queue full

MHM

MHM Cisco World · ‎07-13-2024

Show ip ospf traffic <- share this please for both Routers

MHM

WMA Hell · ‎07-15-2024

Hub#sh ip ospf traffic

OSPF statistics:
Last clearing of OSPF traffic counters never
Rcvd: 26835397 total, 0 checksum errors
7808686 hello, 17 database desc, 2 link state req
15929234 link state updates, 3097450 link state acks
Sent: 19044254 total
7158915 hello, 23 database desc, 3 link state req
6922762 link state updates, 1709045 link state acks

OSPF Router with ID (x.x.x.x) (Process ID 10)

OSPF queue statistics for process ID 10:

InputQ UpdateQ OutputQ
Limit 0 200 0
Drops 0 0 0
Max delay [msec] 50 49 2
Max size 5 5 3
Invalid 0 0 0
Hello 1 0 0
DB des 0 0 0
LS req 0 0 0
LS upd 4 5 0
LS ack 0 0 3
Current size 0 0 0
Invalid 0 0 0
Hello 0 0 0
DB des 0 0 0
LS req 0 0 0
LS upd 0 0 0
LS ack 0 0 0

Interface statistics:

Interface GigabitEthernet0/0/0.17

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 1301775 67692292
RX DB des 4 268
RX LS req 1 36
RX LS upd 9323565 4958067512
RX LS ack 231083 19323672
RX Total 10856428 5045083780

TX Failed 0 0
TX Hello 650922 54677412
TX DB des 6 544
TX LS req 1 116
TX LS upd 317012 62347660
TX LS ack 1453806 125699524
TX Total 2421747 242725256

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/0.13

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 650865 49465740
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 650865 49465740

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/0.12

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 650881 49466956
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 650881 49466956

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/0.5

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 650887 49467412
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 650887 49467412

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/0.4

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 650913 49469388
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 650913 49469388

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/0.1

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 650935 49471060
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 650935 49471060

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface Tunnel100

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 6506915 403422604
RX DB des 13 7816
RX LS req 1 108
RX LS upd 6605675 3370050688
RX LS ack 2866367 202774648
RX Total 15978971 3976255864

TX Failed 0 0
TX Hello 3253513 273294852
TX DB des 17 8148
TX LS req 2 2152
TX LS upd 6605754 3502371724
TX LS ack 255239 33968076
TX Total 10114525 3809644952

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 8,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Summary traffic statistics for process ID 10:

OSPF packets received/sent

Type Packets Bytes
RX Invalid 0 0
RX Hello 7808690 471114896
RX DB des 17 8084
RX LS req 2 144
RX LS upd 15929240 8328118200
RX LS ack 3097450 222098320
RX Total 26835399 9021339644

TX Failed 0 0
TX Hello 7158916 575312820
TX DB des 23 8692
TX LS req 3 2268
TX LS upd 6922766 3564719384
TX LS ack 1709045 159667600
TX Total 15790753 4299710764

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 8,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Hub#

-------------------------------------------------------------------------------

Branch#

OSPF statistics:

Last clearing of OSPF traffic counters never
Rcvd: 983705437 total, 0 checksum errors
381354870 hello, 1560535 database desc, 613 link state req
412480231 link state updates, 188308792 link state acks
Sent: 1030766157 total
56654707 hello, 1698939 database desc, 99 link state req
482301741 link state updates, 165130644 link state acks

OSPF Router with ID (x.x.x.x) (Process ID 10)

OSPF queue statistics for process ID 10:

InputQ UpdateQ OutputQ
Limit 0 200 0
Drops 0 0 0
Max delay [msec] 54 88 7
Max size 26 24 15
Invalid 0 0 0
Hello 0 0 0
DB des 0 0 0
LS req 0 0 0
LS upd 13 24 15
LS ack 13 0 0
Current size 0 0 0
Invalid 0 0 0
Hello 0 0 0
DB des 0 0 0
LS req 0 0 0
LS upd 0 0 0
LS ack 0 0 0

Interface statistics:

Interface GigabitEthernet0/0/2

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 7071390 537425640
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 7071390 537425640

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/1

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 0 0
RX DB des 0 0
RX LS req 0 0
RX LS upd 0 0
RX LS ack 0 0
RX Total 0 0

TX Failed 0 0
TX Hello 7071215 537412340
TX DB des 0 0
TX LS req 0 0
TX LS upd 0 0
TX LS ack 0 0
TX Total 7071215 537412340

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 0,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface GigabitEthernet0/0/0

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 21256468 1187963416
RX DB des 1555581 115699992
RX LS req 458 39696
RX LS upd 19846622 9000339384
RX LS ack 35883504 3132883416
RX Total 78542633 13436925904

TX Failed 0 0
TX Hello 7163664 630401524
TX DB des 1695163 1855575032
TX LS req 22 9884
TX LS upd 42982528 19396250784
TX LS ack 2677642 246016208
TX Total 54519019 22128253432

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 541, LLS 0, Unknown Neighbor 25,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Interface Tunnel100

Last clearing of interface traffic counters never

OSPF packets received/sent
Type Packets Bytes
RX Invalid 0 0
RX Hello 360098402 19599671908
RX DB des 4954 3264988
RX LS req 155 115776
RX LS upd 392633609 175429193392
RX LS ack 152425288 12371883072
RX Total 905162408 207404129136

TX Failed 0 0
TX Hello 35348438 4127579960
TX DB des 3776 2986004
TX LS req 77 6268
TX LS upd 439319213 203781584304
TX LS ack 162453002 15879423648
TX Total 637124506 223791580184

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 0, LLS 0, Unknown Neighbor 27,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Summary traffic statistics for process ID 10:

OSPF packets received/sent

Type Packets Bytes
RX Invalid 0 0
RX Hello 381354870 20787635324
RX DB des 1560535 118964980
RX LS req 613 155472
RX LS upd 412480231 184429532776
RX LS ack 188308792 15504766488
RX Total 983705041 220841055040

TX Failed 0 0
TX Hello 56654707 5832819464
TX DB des 1698939 1858561036
TX LS req 99 16152
TX LS upd 482301741 223177835088
TX LS ack 165130644 16125439856
TX Total 705786130 246994671596

OSPF header errors
Length 0, Instance ID 0, Checksum 0, Auth Type 0,
Version 0, Bad Source 0, No Virtual Link 0,
Area Mismatch 0, No Sham Link 0, Self Originated 0,
Duplicate ID 0, Hello 0, MTU Mismatch 0,
Nbr Ignored 541, LLS 0, Unknown Neighbor 52,
Authentication 0, TTL Check Fail 0, Adjacency Throttle 0,
BFD 0, Test discard 0

OSPF LSA errors
Type 0, Length 0, Data 0, Checksum 0

Branch#