Is QoS needed at the Distribution layer (On two C9500K Stackwise)

freddycalderon85 · ‎05-10-2024

Hello! I was wondering if I could get advice/input when it comes to QoS at the distribution layer of a network topology. Here's a high-level of our network topology.

Soon to deploy two Cisco C9500k in a Stackwise HA configuration. It will have about 10x downstream links to IDFs switches, and their connectivity is 10G ports from the distribution to the10x access layer switches.

We do not use any Cisco IP phones or other IP phone brands. We use Yealink USB to integrate with MS Teams. Other end-users use straight-up MS Teams with their headset to make calls or video conferences. I haven't seen any kind of latency or jitter with the current Cisco 9200 switch acting as the distribution layer. Of course, the Cisco 9200 has a lot of access and 4x trunk ports. The C9500K will only have trunk/tagged ports since the network design will change.

Besides not having IP Phones, do I need to configure QoS for something else? These C9500k are very high-density switches so is QoS needed?

Thanks in advance.

Ramblin Tech · ‎05-10-2024

Wherever there is stat-muxing (ie, over-subscription of links between nodes) in a packet network, there is the possibility of queueing. Wherever there is queueing, there is the possibility of traffic overrunning the queue (even very briefly) and packets being dropped. If you are OK defaulting to randomly tail-dropping all types of traffic with equal probability (weighted by traffic volume), then you do not have to do anything, as all traffic will default to a single FIFO egress queue which tail drops any excess.

OTOH, if you want to set policies as to which traffic has priority, or gets de-prioritized (non-work related traffic), then you need to configure QoS policies. Does your enterprise use video calls (Webex, Zoom, Ooma, Skype, etc)? If so, then you might consider prioritizing that traffic over cat videos on Youtube that are not mission-critical. It all starts with having a policy, so if your official IT stance is that there is no QoS policy, well then that is your policy.

[Back to stat-muxing... unless you have equal bandwidth between your uplinks to the core and downlinks to the access layer, your distribution layer is aggregating and stat-muxing. If your distribution gozintas actually equals your gozoutas, then you are missing an opportunity to aggregate and you could probably just replace your distribution layer with some fiber jumpers!]

Disclaimer: I am long in CSCO

Joseph W. Doherty · ‎05-10-2024

Just to add another aspect to Jim's reply, whenever there's queuing, queues add packet delay. I.e. packet delay, alone, can be detrimental to some network apps.

Queues are finite, and as Jim further explains, can overflow. Packet drops can be even more detrimental to network apps.

That said, various network apps have various tolerances to additional delay and/or drops. As long as those tolerances are not exceeded, you really don't need QoS.

However, even when it appears QoS is not needed, QoS can be like insurance, it can be there when unexpectedly/unplanned need arises. Often QoS is worth considering when you need to "guarantee" certain network apps will work well, at all times or at any time.

Lastly, as Jim notes, any over-subscription opens up the need for queuing and over-subscription might be found anywhere in the network, i.e. it's not just a distribution level consideration. (BTW, it's nearly impossible to have a real network without oversubscription. Consider 3 hosts on the same switch, where any host to host communication may happen. Can you avoid oversubscription?)

Reza Sharifi · ‎05-11-2024

Joe,

I like the insurance policy analogy!

Reza

freddycalderon85 · ‎05-13-2024

Thank you, Joseph! I appreciate your input.

freddycalderon85 · ‎05-13-2024

Thank Ramblin for your input. Very helpful. Yes, we do use Webex, Zoom, and MS Teams but have not seen any kind of traffic congestion on the network.

If I wanted to prioritize traffic, such as mission-critical, what QoS configuration would you recommend?

Thanks

Ramblin Tech · ‎05-13-2024

Hi @freddycalderon85

I would first start with some sort of written policy that everyone can stare at and then nod their heads (particularly management types), to hedge against someone coming along later demanding to know why their favorite app is not being treated as mission critical. The policy should list the mission-critical apps, the criteria for deeming an app to be mission-critical, and some broad statement about mission-critical app traffic receiving differentiated, preferential treatment over traffic that is not. I am taking the approach here that mission-critical apps will be explicitly identified as such, with all else defaulting to "other" status. You could just as well take the approach that explicitly identified non mission-critical apps are to be de-prioritized, with other traffic defaulting to a priority status. The important thing here is that you are identifying traffic classes, with criteria for apps to be associated with the classes. The "mission-critical" class criteria should get some management endorsement as the criteria might be somewhat subjective as opposed to classes based on quantifiable network performance requirement criteria (loss, latency, jitter, availability).

With the traffic classes identified, dig into Modular QoS CLI if you are not already familiar with it. MQC has essentially three interrelated steps:

Identify traffic through class-maps
Establish a policy as to what to do with those traffic classes via a policy-map
Apply the policy-map to an interface in an ingress or egress direction via the service-policy command

For class-maps definitions (#1), you would focus on identifying deterministic criteria about mission-critical packets that allows you to match them to your traffic classes in your written policy such as source or destination addresses, source or destination TCP/UDP port numbers, QoS markings (L2 CoS, L3 ToS/IPPrec/DiffServ), ingress interface, VLAN, etc. You would then use the class-map definitions in your policy-maps (#2) to establish differentiated treatments such as rate-limiting (policing/shaping), queueing (which queue? any active queue management such as WRED?), and scheduling from the queues for transmission (what order are the queues serviced? are queues serviced until exhaustion or for some quanta of time or data?).

While the MQC syntax is generalized, flexible, and Platform Independent, the actual QoS capabilities of a given system are highly Platform Dependent. In the case of the Cat9K family, forwarding and QoS take place in an NPU (Network Processing Unit). NPUs are much, much faster than software-based forwarding platforms, but they can also be far less flexible as their QoS capabilities are ultimately fixed in hardware, rather than being malleable in software. I recommend taking a look at the Cat9K forwarding and QoS presentation from CiscoLive! to get a feel for the supported QoS capabilities of Cat9500 platform. From that you can start to see what you can actually match on in your class-maps and what rate-limiting/queueing/scheduling capabilities are possible. With those capabilities in mind, you can start to sketch out your MQC configs and bring those ideas back to this forum for review and advice.

https://www.ciscolive.com/on-demand/on-demand-library.html?zid=pp&search=catalyst%209000#/session/1701824091968001nPgJ

Disclaimer: I am long in CSCO

Joseph W. Doherty · ‎05-13-2024

@freddycalderon85 wrote:

Yes, we do use Webex, Zoom, and MS Teams but have not seen any kind of traffic congestion on the network.

If I wanted to prioritize traffic, such as mission-critical, what QoS configuration would you recommend?

Hope you don't mind my throwing my two bits worth in. ; )

So you have not seen congestion, eh? Well, most don't see it because it can often be very fleeting and also often very difficult to "see" with typical monitoring. This, for example, sometimes leads to users complaining about network performance, but "obviously" users don't know what they are talking about as none of our links show more than 20% utilization!

Personally, I consider there's congestion whenever a packet arrives at an interface and it cannot be immediately transmitted. This is quite common. Fortunately, congestion usually isn't adverse to network applications until you have "significant" queuing delay and/or "significant" packet drops. (I quote "significant" because that's very dependent on the network app, for example, VoIP and FTP have much different delay and drop tolerances. Conversely, though, by my definition of congestion, if there is none, generally any/all network apps are "happy".)

Anyway for real-time audio/video apps, then tend to be the least tolerant of network delay and/or drops.

Real-time audio and real-time video, don't really vary much in their tolerances, but video tends to consume much more bandwidth than audio and might be much more variable in its bandwidth usage, moment to moment.

The recommended priority for real-time audio is almost always first priority for dequeuing and enough bandwidth guarantee to insure it doesn't self queue nor drop. Often audio will be placed in LLQ or PQ to insure both.

Real-time video, is much the same as real-time audio but often, it's deprioritized only to just real-time audio. Some later CBWFQ implementations offer a LLQ/PQ with two priority levels. Audio being directed to level 1, and video to level 2.

In the forgoing, notice I kept using "real-time", this because not all audio (e.g. VoIP control) or video (e.g. streaming) is real-time, or perhaps not the actual "bearer" of the audio or video. Those traffic types have much more tolerance, usually for delay, but still are usually sensitive to drops.

Lastly, for "mission-critical", to me, that's often a term used because all traffic has been best-effort with FIFO queues and users have seen erratic performance (I know, they are {IMO not} crazy, because no link shows more than 20% utilization). So, a common QoS solution is to provide such "mission-critical" traffic prioritization over other traffic (which often does improve how those apps works - of course, often then all other network apps tend to work even worse).

Leo Laohoo · ‎05-10-2024

For IOS-XE, the following command is mandatory:

qos queue-softmax-multiplier 1200

Joseph W. Doherty · ‎05-11-2024

I'm guessing, where Leo works, that QoS command is part of their standards.

Regardless, when Leo says it is mandatory, I would take that as a really, really (really) strong recommendation it should be used.

However, although almost all the time, that command wouldn't be adverse, it might be. There are reasons why it's not the default to begin with. Again, understand, the probability of it being adverse are very low, so I believe.

That command is briefly described in Cisco Catalyst 9000 Switching Platforms: QoS and Queuing White Paper. This White Paper will note some of the possible adverse possibilities.

Generally, I also recommend the command's usage with a 1200 value.