Solved: ASR 1001-X Default Queue Congestion

jhaynes41 · ‎09-20-2017

Hi,
We are running a MPLS based DMVPN to about 60 sites. We have a mix of bandwidths and are using per tunnel QOS in order to match each sites bandwidth. Our 4Mbps sites are experiencing issues where traffic destined for the default queue takes up most of the bandwidth at the site and locks out other traffic. Priority traffic like our VOIP seems to work just fine but other traffic queues get starved. We opened a case with Cisco and they suggested a couple of things. That still hasn't worked and we have put in a workaround but would like to fundamentally address the problem.

Here was the original config:

policy-map WAN-EDGE-4Mbs
class VOICE
priority percent 23
class INTERACTIVE-VIDEO
priority percent 10
class NETWORK-CONTROL
bandwidth percent 5
class SIGNALING
bandwidth percent 2
class CRITICAL-DATA
bandwidth percent 24
  fair-queue
random-detect dscp-based
class MULTIMEDIASTREAMING
bandwidth percent 10
  fair-queue
random-detect dscp-based
class SCAVENGER
bandwidth percent 1
class class-default
bandwidth percent 25
  fair-queue
random-detect dscp-based
queue-limit 320 packets

policy-map 4Mbs
class class-default
shape average 4000000
   service-policy WAN-EDGE-4Mbs

As I said Cisco recommended a some changes and the default queue was modified to:

class class-default
random-detect dscp-based
queue-limit 320 packets

thus removing the bandwidth allocation and the fair queuing.

This still has not solved the problem as something such as a file transfer in the default queue still will not relinquish enough bandwidth to allow more interactive traffic to work.

We have tried defining a separate class for our most problematic traffic and shaping it within the policy which helps but is not really a solution as there are any number of types of traffic that can cause this issue.

I'd appreciate any ideas.

Thanks in advance.

Joseph W. Doherty · ‎09-21-2017

Interesting! They are setting the overall queue limit to 128 packets and setting RED's max also to 128 packets. One might think this aligns the two drop settings, but RED drops are based on a moving average whereas queue limits are the current queue depths. (One of the funky things about RED, since it uses a moving average, it might not drop from a deep actual queue that has rapidly burst to that size and conversely it may drop from a shallow current queue that just recently rapidly shrunk.

Additionally, with FQ, you don't really want to bump up against overall class queue drops, you want per flow queue drops. Besides FQ dequeuing multiple flow queues, side-by-side, you generally don't want to drop packets from shallow flow queues but from the deep flow queues, and you may want to do this before the overall class fills.

Rarely Cisco makes poor recommendations, but I believe this is such.

View solution in original post

Joseph W. Doherty · ‎09-20-2017

If you have flow competition within a class, FQ should be a way to mitigate it. Unclear why Cisco recommended its removal. (BTW, I recommend against using WRED, unless you're a QoS wizard, and especially recommend against using it with FQ [unless perhaps using Cisco's FRED variant].) Rather than WRED, you might see if the platform allows setting the flow queue lengths. Decreasing them should tail drop bandwidth hog flows sooner.

That said, you might have the issue of too many flows for the number of FQ flow queues. And/or the problem that the number of bandwidth hog flows, within FQ, is such that the non-hog flows don't receive enough bandwidth to keep them from being adversely impacted.

For the former condition, if the platform offers it, see if the number of per class flow queues might be increased. For the latter, you need to move, if you can recognize them, bandwidth hogs into a different class. (Which you're trying to do, although you're correct, the number of kinds of flows can make this difficult. In the past, instead of looking at the "kind" of traffic, I would sometimes look at the transmission rate or other attributes of the traffic, such at TCP options and packet sizes for "clues" of what flows are bandwidth hogs.)

(BTW, unsure what the default is on an ASR is, but you also might have issues with FIFO queues before the class FQ. For example, hardware interfaces have their tx-ring which often needs to be decreased to keep traffic from FIFO queuing in it. Unsure whether the shaper has its own FIFO queue.)

jhaynes41 · ‎09-20-2017

Hi Joseph,

Thanks for the quick reply.

So, just for your knowledge we are using the 8 class WAN Edge QOS Model in the Cisco Enterprise QOS SRND 4.0. In that guide they recommend using the fair-queue and the WRED together and the fair-queue is supposed to act as a pre-sorter before the WRED ever gets engaged. What we kept seeing is the WRED never being engaged and the fair-queue not dropping enough traffic.

I am digging into the ASR 1001-X to see about adjusting the number of flow queues.

We actually have a fairly robust Netflow so we are able to identify this traffic fairly quickly it's just that a new flow shows up after we've identified what we think is the rest and would like to see if we can isolate it to the default queue just to save configuration time and hassle.

Thanks,

Jim

Joseph W. Doherty · ‎09-20-2017

"So, just for your knowledge we are using the 8 class WAN Edge QOS Model in the Cisco Enterprise QOS SRND 4.0. In that guide they recommend using the fair-queue and the WRED together and the fair-queue is supposed to act as a pre-sorter before the WRED ever gets engaged."

Hmm, I haven't read SRND 4.0 concerning ASRs. Traditionally, WRED treats packets as they are being added to the class queue(s). Might you have a reference to what you've read, above?

That aside, again, I recommend against using WRED unless you're a QoS expert. It's surprisingly difficult to get it to work optimally. Also Cisco defaults leave much to be desired (the latter likely why you didn't see it ever being engaged on your ASR).

Sure, I understand you can identify bandwidth hog flows, with Netflow, but, of course, who wants to keep doing that. So, also again, using other traffic attributes, you often don't need too.

What got me on the path of using other traffic attributes was recognizing that the traffic "kind" is often not enough information. For example, Microsoft traffic all uses the same ports, and other traffic "kinds" can have corner cases. For example, list an Internet route table, not paged, with telnet, and effectively you have a bandwidth hog flow. Or, for example, encrypted traffic might "hide" contents, but it doesn't hide its bandwidth demand. Such provides a chance to treat SSH "telnet like" traffic from SCP traffic although from a "kind" perspective, they "look" alike.

jhaynes41 · ‎09-20-2017

Sure, here is the link:

https://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/WAN_and_MAN/QoS_SRND_40/QoSWAN_40.html

I agree its annoying to keep looking at various kinds. We will explore the attributes as an option to key on.

Joseph W. Doherty · ‎09-20-2017

Thanks for the link!

I didn't see a recommendation to specifically use class FQ and WRED together, although that guide does have "Therefore it is recommended to enable WRED on all TCP-based traffic classes." which could leave you to believe they might or should be used together. But, that's really not so. With per class FQ, it's often better to allow each flow to tail drop that flow's packets when that flow congests.

If you only had global FIFO, WRED could be a better option, but again, it's difficult to get it just right. (Search the Internet on RED and see all the many "improved" variants - that's a clue it doesn't work as well as one might expect. [One of the issue with the Cisco variant, it drops at the tail of the queue, where RED works better if you drop at the front of the queue.])

jhaynes41 · ‎09-21-2017

Here is the part from the 8 class config where they have it enabled. There is also a blurb about it above the config where they discuss what they are doing. They actually have it enabled in a number of the classes this way.

Router(config-pmap-c)# class class-default
Router(config-pmap-c)# bandwidth percent 25
! Provisions 25% CBWFQ for default (Best-Effort) class
Router(config-pmap-c)# fair-queue
! Enables fair-queuing pre-sorter on default (Best-Effort) class
Router(config-pmap-c)# queue-limit 128 packets
! Expands queue-limit to 128 packets
Router(config-pmap-c)# random-detect dscp-based
! Enables DSCP-based WRED on default (Best-Effort) class
Router(config-pmap-c)# random-detect dscp default 100 128
! Tunes WRED min and max drop-thresholds for DF/0 to 100 and 128 packets

Let me check out the other variants that are out there. Very interesting.

Joseph W. Doherty · ‎09-21-2017

Interesting! They are setting the overall queue limit to 128 packets and setting RED's max also to 128 packets. One might think this aligns the two drop settings, but RED drops are based on a moving average whereas queue limits are the current queue depths. (One of the funky things about RED, since it uses a moving average, it might not drop from a deep actual queue that has rapidly burst to that size and conversely it may drop from a shallow current queue that just recently rapidly shrunk.

Additionally, with FQ, you don't really want to bump up against overall class queue drops, you want per flow queue drops. Besides FQ dequeuing multiple flow queues, side-by-side, you generally don't want to drop packets from shallow flow queues but from the deep flow queues, and you may want to do this before the overall class fills.

Rarely Cisco makes poor recommendations, but I believe this is such.