Re: Output Discards on Interface, Without Outgoing Congestion

Verbatim · ‎05-23-2024

This is a C1100 series. The chart (rate.png) shows the issue: there is a spike in output discards, which happens at the exact same time as a spike in input utilization is taking place on that interface. The interface is full duplex. There is hardly any outgoing traffic on the interface, yet drops are occurring as though there is congestion.

There is QoS configured on the interface; the drops are happening only in the class default queue:

(queue depth/total drops/no-buffer drops) 0/4359/0

(Above retrieved immediately after the queues were cleared for the interface and the spike occurred).

policy-map QOS-WAN-H
class QOS-PRIORITY
priority percent 20
class QOS-REAL-TIME
bandwidth remaining percent 20
queue-limit 125 packets
class QOS-SIGNALING
bandwidth remaining percent 20
queue-limit 125 packets
class QOS-TIME-SENSITIVE
bandwidth remaining percent 10
queue-limit 125 packets
class QOS-BULK
bandwidth remaining percent 1
queue-limit 125 packets
class class-default
bandwidth remaining percent 49
queue-limit 4000 packets

Is it possible what is occurring is an Out of Resources drop? Figure 4. Memory Reserved for Priority Packets: https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos_mqc/configuration/xe-16-11/qos-mqc-xe-16-11-book/qos-limits-wred.html

Joseph W. Doherty · ‎05-23-2024

"Is it possible what is occurring is an Out of Resources drop?"

Possibly, but your information shows a zero no buffers count and you don't provide any additional information, such as show policy interface stats, possible syslog messages as noted in the document you reference and/or stats from additional show commands, as also noted in the document you referenced.

Personally, I suspect it's another cause. Perhaps when input bandwidth is high, many small packets, like for TCP ACKs is being generated for egress, which hits the packets limits for egress queuing (and the small packets wouldn't account for much bandwidth [i.e. one 1500 byte packet vs. twenty-three 64 byte packets]). If such is happening, changing queue limits to bytes might mitigate the drops.

Verbatim · ‎05-24-2024

Would there be a nonzero buffer count if the shared memory for the queues ran out like the document described?

"If a normal data packet arrives it is dropped because the OOR threshold has been reached. However, if a priority packet were to arrive it would still be enqueued as there is physical space available." - Does this mean syslog entries would be recorded and the buffer count would be incremented?

The syslog shows no relevant log entries over that time frame. Would it make sense to turn on debugging? Which ones, if any?

Attempting to run:

show platform hardware qfp active bqs 0 packet-buffer utilization

Caused an error, possibly because bqs is not available on this hardware?

That's a plausible explanation, that many small acknowledgements are filling the output queue in response to the high input traffic. 4000 of them though? Over the space of 5 minutes (the load-interval)? Micro burst might explain.

Joseph W. Doherty · ‎05-24-2024

Lots of (good) questions in your reply, which I don't know the answers to without some further research and perhaps requiring a like platform to see how it actually works.

However in answer to your question about filling a 4k queue, that has nothing to do with load interval. (BTW, at gig rate, it would only take about 2ms to have 4k 64 byte packets.)

How long it would take to fill would depend on ingress vs. egress bandwidths. Also, I thought I read in your reference the 4k setting might not actually set 4k.

As to using debug, often it can be very informative, but it can take a better performance. Possibly a better way would be wire capture.

Verbatim · ‎05-24-2024

This is a 20m wan connection.

The default class queue limit was originally set to 1000 packets; it was increased to 2000, and then again to 4000.

Wire capture would probably be better, but we'd have to either create identical conditions in order to capture, or else capture for very long periods of time (without having someone to constantly monitor). This is happening on a "every few days" basis.

Joseph W. Doherty · ‎05-24-2024

What's the LAN interface bandwidth?

Do you use a hierarchical policy to shape for 20Mbps? If so, exactly for 20Mbps?

What's the WAN interface's physical bandwidth?

Verbatim · ‎05-25-2024

It looks like this router acts as switch for the LAN; ports gig0/1/2 - gig0/1/7 are configured as access ports, with one currently showing up/up; gig0/1/0 is trunked to an AP and gig0/1/1 is to a printer. So LAN bandwidth could easily exceed 20m.

The WAN interface has service-policy output QOS-WAN-M-20MB; which is:

policy-map QOS-WAN-M-20MB
class class-default
shape average 20000000
service-policy QOS-WAN-H

QOS-WAN-H is shown in its entirety in the first post.

Joseph W. Doherty · ‎05-25-2024

"So LAN bandwidth could easily exceed 20m."

Indeed.

Thinking some more about this, seems unlikely it's TCP ACKs filling egress queue, as I cannot see how the downstream data flows could trigger them at an excessive rate.

However, possibly some other traffic, with small packets, rapidly fills and exceeds even a queue depth of 4000, which for minimum sized packets, would only take about 2ms if ingress is gig.

It can be difficult to totally mitigate drops, as it is a part of TCP, and other traffic types, flow control. Also very deep queue create their own problems, from adding queuing latency, to actually creating more drops (when a major portion of a large send window is dropped), to slowing overall transfer rate (due to causing a flow to timeout waiting on ACKs and/or dropping a flow back to slow start).

What you might consider is using FQ within class-default, this should minimize adverse effect of one "run away" flow being adverse to the other flows in the class, setting queue depth in bytes (if supported) and setting queue resource to about half of BDP.

BTW, I believe in the past, Cisco's shapers "count" packet bytes, not frame bytes, and WAN provider is policing on frame bytes, the Cisco shaper will overrun the provider's policer. This can be mitigated, if its happening, by shaping about 15% slower than nominal CIR to allow for average L2 frame overhead.