Re: QoS for High receive utilization

ravi bejjarapu · ‎07-21-2024

Hi Team

I need some guidance on Qos . our network is facing the high receive utilization for one of the branch site. During business hours its crossing the 95%. we have legacy Hub and spoke infrastructure. that means all the traffic should pass through DC. DC link is 1Gbps Qos is applied on DC link with shaping average of 200mbps to the branch site. But still we see high receive utilization on branch site . policing is not agreed in our infra. kindly suggest how to solve this with out the link upgrade. Attached the Diagram.

Thanks,

Ravi.

M02@rt37 · ‎07-21-2024

Hello @ravi bejjarapu

Given your setup with a legacy Hub and Spoke infrastructure, all traffic from the branch must traverse through the DC, where you've implemented a shaping policy of 200Mbps on a 1Gbps link. You're experiencing high receive utilization (crossing 95%) at the branch site. Consider adjusting your QoS policies and potentially implementing traffic prioritization and classification. Start by analyzing the types of traffic that are consuming the most bandwidth.. Identify essential versus non-essential traffic; for example, business-critical applications should be prioritized over less important traffic such as file downloads or streaming media. Adjust your QoS settings to prioritize and allocate bandwidth more effectively for high-priority traffic while limiting the bandwidth for less critical services.

Additionally, you might want to re-evaluate the current traffic shaping policy. The average shaping of 200Mbps might be insufficient during peak hours, leading to congestion and high utilization at the branch. Instead, consider dynamic shaping policies that can adapt to traffic patterns throughout the day, possibly increasing the bandwidth allocation during peak business hours and reducing it during off-peak times. Implementing techniques such as Weighted Fair Queuing (WFQ) can also help ensure fair bandwidth distribution among different types of traffic, preventing any single type from monopolizing the link. Traffic compression and optimization techniques can further help by reducing the overall volume of data transmitted.

Best regards
.ı|ı.ı|ı. If This Helps, Please Rate .ı|ı.ı|ı.

Joseph W. Doherty · ‎07-21-2024

Would be helpful if you would explain the actual WAN technology being used, all WAN bandwidth limits (both physical [interface] and logical [CIR, if any]), and actual QoS policy you're using.

"During business hours its crossing the 95%."

"But still we see high receive utilization on branch site."

And so? I.e. why is this a problem?

ravi bejjarapu · ‎07-21-2024

M02@rt37 prioritization of traffic is already in place for essential and non-essential traffic; for( business-critical, voice ,video, business bulk. etc..) i will have a check on WFQ and could you share little bit more detail on "traffic compression" and optimization techniques.

@Joseph W. Doherty WAN technology is MPLS we have EBGP with PE. BW limit is, interface speed is 1Gbps and CIR is 1Gps.

Classification and marking at LAN interface , Queuing and shaping at WAN interface. detailed diagram is attached.

Due to link congestion as it is crossing 95% users are experiencing slowness .

could you suggest the points in my attached Diagram

Joseph W. Doherty · ‎07-21-2024

Branch interface and CIRs?

Egress policy at hub egress?

No spoke to spoke traffic at all?

ravi bejjarapu · ‎07-21-2024

@Joseph W. Doherty

Branch physical interface speed is 1Gps and CIR is 200Mbps

yes Egress policy is at HUB Egress interface (DC)

no communication between two spokes. ( i.e no traffic or communication between branch sites)

Joseph W. Doherty · ‎07-21-2024

CIR is enforced in both directions? (Some WAN cloud vendors only enforce upon ingress to cloud.)

Any QoS on branch?

Again, would like to see your actual QoS policy config.

Joseph W. Doherty · ‎07-21-2024

I'm going to make some general suggestions, unfortunately, for lack of information, cannot make specific suggestions.

Firstly, to manage downstream bandwidth limits, such as CIR, you're on the right track using shapers. However, one major gotcha, which I believe is common with Cisco shapers, most of them don't account for L1/L2 overhead, but often service providers may. Some of the later Cisco shapers do or can account for overhead, but since I don't know you specific platforms, or their running IOS versions, cannot say if this is an issue for you or not. If it is, and the router's shaper doesn't account for L1/L2 overhead, you can shape for a slower rate to allow for the overhead. I've found shaping at about 15% slower than CIR usually works fairly well, but your mileage might vary. (BTW, L1/L2 overhead percentage varies per packet size, more as packet size decreases, least impact for a maximum packet size. You can shape for the worst case, which guarantees you'll not oversubscribe the CIR, but, on average, it usually "wastes" a good proportion of actual available bandwidth.)

The big advantage of shaping, you can determine if there is reoccurring congestion. If there is, there are two ways to deal with it, either obtain additional bandwidth and/or use QoS to better manage your bandwidth usage.

Without getting into advanced/complicated QoS bandwidth management policies, often fair-queue, alone, is sufficient.

A representative QoS might be, on the hub, something like:

policy-map ChildFQ
 class class-default
  fair-queue

policy-map HubWanEgress
 class site1
  shape average 170000000
  service-policy ChildFQ
 class site2
  shape average 85000000
  service-policy ChildFQ
 class site3
  shape average 161500000
  service-policy ChildFQ

On the spoke sites, they too should have a policy to limit themselves to their outbound CIR.

As an example for site1:

policy-map ChildFQ
 class class-default
  fair-queue

policy-map SpokeWanEgress
 class class-default
  shape average 170000000
  service-policy ChildFQ

Something also to consider, if the aggregate of all the spoke sites, either ingress or egress, can overrun the hub's capabilities, you should manage that too.

ravi bejjarapu · ‎07-22-2024

@Joseph W. Doherty

We are using the similar configs which you mentioned above for both HUB and spoke. one difference i found is in "class default" we are using the default dscp 0

IOS versions:

HUB Router ISR 4451 : isr4400-universalk9.16.12.07.SPA.bin

Sopke Router ISR 4331: isr4300-universalk9.16.12.07.SPA.bin

please let me know if i can share the configs separately for a reveiw

Joseph W. Doherty · ‎07-22-2024

What level licenses are you running on the 4Ks, as, especially for a 4331, even its performance license is a little undersized for 200 Mbps, bi-directional?

Sure you can share configs separately. However, I'm only really interested in your interface and QoS configs. Might also be nice to see your interface QoS service policy stats too.

Unsure about what you're trying to convey about class default and DSCP 0 difference, but seeing your configs may make that clear.

Joseph W. Doherty · ‎07-22-2024

BTW, it appears you IOS version does support (optional) Ethernet overhead accounting, see https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos_plcshp/configuration/xe-16-12/qos-plcshp-xe-16-12-book/qos-plcshp-ether-ohead-actg.html

With the above feature, you should be able to shape for nominal CIR, and not go over it, if WAN provider is also accounting for such overhead.

This feature will use the bandwidth more optimally than using a lessor CIR.

MHM Cisco World · ‎07-22-2024

Sorry I have limit Ack in QoS but I am in my way to solve study abd solve issues related to QoS'

Anyway

You mention defualt class have dscp 0

The qos policy like ACL it check class one by one and mark packet' if no match in any class it use implicit class defualt and mark as you mention with dscp 0' this make this traffic is get best effort

I.e. the traffic that not confin in above class will be forward always'

This traffic can be any but the issue it can effect the control traffic' so try give class defualt not dscp 0 and check.

MHM

Joseph W. Doherty · ‎07-22-2024

@MHM Cisco World wrote:

You mention defualt class have dscp 0

The qos policy like ACL it check class one by one and mark packet' if no match in any class it use implicit class defualt and mark as you mention with dscp 0' this make this traffic is get best effort

I.e. the traffic that not confin in above class will be forward always'

This traffic can be any but the issue it can effect the control traffic' so try give class defualt not dscp 0 and check.

Not quite the way it works.

There's only one class class-default, and it's always present, either explicitly defined or implicitly defined.

You're correct, any traffic that doesn't match another explicitly defined class, will be directed to class class-default.

There's no implicit relationship between DSCP BE (or IPPrec 0) and class class-default, nor is there any implicit handling of class class-default beyond, by default, it uses a class FIFO queue.

BTW, logically, the following 3 configuration snippets are the same:

interface x

==================================

policy-map x

interface x
 service-policy x

==================================

policy-map x
 class class-default

interface x
 service-policy x

ravi bejjarapu · ‎07-22-2024

@Joseph W. Doherty

Have sent the required configs separately.

Joseph W. Doherty · ‎07-23-2024

@ravi bejjarapu wrote:

@Joseph W. Doherty

Have sent the required configs separately.

Got it.

Still lacking much information that I would like to see, but some suggestions . . .

Again, if we're going to shape to avoid bumping into WAN provider's bandwidth limits, so that we can manage congestion, it's rather important we don't exceed those limits. So, firstly, you might contact your provider and find out exactly what they are "counting" toward those limits. If they are only counting packet bits, your current shaper settings should be fine. But, if they are truly trying to provide the equivalent of wire bit rates, you want to allow for L1 and L2 overhead, which your IOS appears to be able to easily do. Non VLAN tagged Ethernet frames have an additional 18 bytes L2 overhead, and L1 adds an additional 20 bytes. If you cannot get clarity from your WAN provider, try: shape average # account user-define 38

I see your hub policy uses multiple child policies (which can be fine), and the child policy you sent me sort of is "QoS book" or AutoQoS kind of policy. Such policies, IMO, are often overly complex and if device supports FQ, are unnecessary. (If you think about it, CBWFQ, does its FQing between classes, but if classes support FQ, do you need as many class queues?)

I also see WRED being used in class-default. I generally often strongly recommend WRED NOT be used by other than QoS experts. Class FQ, when available, IMO, is often a much superior choice. If FQ not available, possibly even WTD is a better choice. (BTW, WRED isn't "bad", per se, it's more of a useful specialized technique for specific use cases, at least in modern networks.)

Here's my generic base QoS model:

policy-map BaseModel
class real-time
priority percent 35
class hi-priority
bandwidth remaining percent 81
fair-queue
class lo-priority
bandwidth remaining percent 1
fair-queue
class class-default
bandwidth remaining percent 9
fair-queue

In the above model, almost all traffic should be in class-default, where FQ, alone, often does a very nice job of keeping bandwidth hogs from stomping all over light weight traffic flows.

For some traffic, that really needs priority guarantees, something like VoIP data and/or VidConf can be directed to the real-time class. "Known" bandwidth hogs, something like email between servers, data backups, etc., might be deprioritized by directing to the lo-priority (or background) class. Traffic that really, really needs a priority boost, above and beyond what class-default's FQ provides, but doesn't need quite the service guarantee of something like VoIP, might be directed to hi-priority (or foregound) class. Ideally, only light weight kinds of traffic should be placed into this class (e.g. screen scraping apps).

The above model really is built around 4 priority levels, absolute priority over all else, normal priority, better than normal priority and less than normal priority. FQ within all but the absolute priority class, generally precludes heavy bandwidth using flows from being adverse to light bandwidth using flows, within the same class. The 3 non-LLQ classes do have a major prioritization split, between them, so a busy higher class can be adverse to a lower class. (Often, ideally, you want actual class loads to be the converse of their bandwidth allocations, which have been set for dequeuing priority, not expected bandwidth utilization. I.e. you want the hi-priority class to use the least amount of bandwidth, and bulk bandwidth flows to be in the lo-priority class. NB: I had cases, where using a similar policy, total link utilization was running at 100% all day long, during business hours, but that because (laptop) client to server data backups, running in the the low-priority 1% class were consuming 80% of the link. The other 20%, though, was usually dequeued before such backup traffic. Effectively, that 20% average usage had on-call 99% of the link's bandwidth, so it pretty much acted like there was no other traffic on the link.

An example I used to like to use, to show the power of FQ, i would set up a link between lab routers. On the "local" router, I would telnet to the second router. All works just fine.

Next, also on the "local" side, I would send, using a traffic generator, 110% UDP traffic to the far side. While that was happening, telnet to the far side, generally, was not at all usable.

Next, on the "local", with the same 110% data stream, I would enable FQ. Then, telnet started to respond, very similar to when there was no UDP traffic being concurrently sent.

Also, beside telnet, I might also ping the "far" router too. Without FQ, when the 110% data stream was running, most pings failed or showed very high latency (compared to when the UDP wasn't running). Again, with FQ, and the data stream active, not a huge increase in ping times, but no lost pings.

Understand, the above policy is a base, and can be "tweaked" in various ways. For instance, if I had streaming video, the default FQ queue depths might be inadequate, so I might need to increase per flow and/or class queue depths, for class-default (if supported), or create another class for streaming video, probably about the same priority as class-default (with streaming video, latency and jitter, because of inherent client buffering, are usually rather lax, but you don't want to drop packets).

Also, BTW, believe your routers may offer a two level LLQ. If so, you might place VoIP in L1 and VidConf in L2.

If your topology is using strictly hub and spoke, you should be able to manage your QoS needs on your devices. If you start to do any spoke to spoke, you may check whether your MPLS vendor can provide any QoS support (often they can, and will, sometimes at extra cost, sometimes no extra cost, but also often, their QoS support tends to be very basic).

Lastly, the one interface stats you provided (branch, correct?) showed a drop percentage of 0.088. Under the old rule of thumb that anything less than 1% is acceptable, looks fine, but firstly, the old rule really applied to bulk data transfers, like FTP, and second, that an average of weeks, you may be having much, much higher drop spikes to bouts of transient congestion, which is the kind of thing that users complain about the network being occasionally slow.

Unfortunately, network engineers only see stats on multiple seconds to multiple minutes, but transient congestion, which can cause users to perceive the network being slow, often happens down in the millisecond range. This is one of the reasons I so much like FQ, it deals well congestion, down at individual packets. As TCP's slow start, is particularly bursty, it protects other concurrent flows from that, and will often just drops packets from such a flow.

To summarize, shape such that you can manage any congestion, not allowing a provider to willy-nilly drop your packets, insure traffic has the bandwidth it needs, and when there's not sufficient bandwidth for all flows' "want", concurrently, chose which flows' packets will be dropped.

If you take to heart, the above, what you might do, is create a new child policy and try it on just one site, both on HQ and/or branch. See how it works for you.

Again, which a policy using FQ, even if you use a traffic generator to saturate a link (using one or multiple flows), it shouldn't be too adverse to same class FQ'ed traffic and not to any higher level classes.

Such a test often makes believers out of the power of FQ and good traffic prioritization management.