Solved: How exactly does WRED help avoid congestion?

Mitrixsen · ‎12-15-2024

Hello, everyone.

I understand what WRED is and what it does. I understand that once we exceed a certain threshold, it will start randomly dropping packets. I also understand that since it's weighted RED, it can be more strict towards low-priority traffic. This is all a basic explanation of course.

My question is, WRED can drop packets more often from low-priority queues than from high-priority queues.

However, if we have two traffic classes therefore two queues - one for high-priority traffic and one for low-priority traffic, how would dropping something from the low-priority queue help the high-priority queue to avoid congestion? If they are two separate classes/queues with their own % of BW reserved.

Or in other words, I would understand this if low and high priority traffic were in the same queue and the lower-priority one would be dropped more often but they’re in separate queues, aren’t they? So dropping something from the low-prioriy queue wouldn’t really help the high-priority queue to avoid congestion, or not?

Thank you.

David

Joseph W. Doherty · ‎12-15-2024

In the specific case you're asking, dropping more packets from one queue than another (relative priorities between the queues doesn't matter) doesn't matter except for possibly avoiding exhaustion of all buffers which would impact all queues. Also, for that latter purpose, tail drop works just as well, and WRED also tail drops. If fact, on some platforms, for any one queue, its packets might be WRED randomly tail dropped, WRED tail dropped, logical FIFO tail dropped, logical flow FIFO tail dropped or exhaustion of buffers dropped.

BTW, I generally recommend only QoS experts use WRED. It's "advertised" as a great and simple queue management tool, but it's surprising difficult to get it to really work as desired.

I like to point out RED's creator, Dr. Sally Floyd, quickly revised her original specification, as it didn't work as expected.

Further, if you delve into the subject, you'll find multiple variants to fix or improve RED. Hmm, why?

Do I recommend never using WRED? Not at all, and it's one way to provide weighted tail drop thresholds without using its random drop.

If you do use its random drop capability, you really need to set min and max thresholds, as needed, which Cisco defaults often do not (also defaults can vary much between platforms, or possibly IOS version too). The other RED parameter options probably should be left alone. (Even I generally don't touch the latter as the theory for setting them is complex. [Sort of like a CRC computation determination.])

View solution in original post

Joseph W. Doherty · ‎12-15-2024

In the specific case you're asking, dropping more packets from one queue than another (relative priorities between the queues doesn't matter) doesn't matter except for possibly avoiding exhaustion of all buffers which would impact all queues. Also, for that latter purpose, tail drop works just as well, and WRED also tail drops. If fact, on some platforms, for any one queue, its packets might be WRED randomly tail dropped, WRED tail dropped, logical FIFO tail dropped, logical flow FIFO tail dropped or exhaustion of buffers dropped.

BTW, I generally recommend only QoS experts use WRED. It's "advertised" as a great and simple queue management tool, but it's surprising difficult to get it to really work as desired.

I like to point out RED's creator, Dr. Sally Floyd, quickly revised her original specification, as it didn't work as expected.

Further, if you delve into the subject, you'll find multiple variants to fix or improve RED. Hmm, why?

Do I recommend never using WRED? Not at all, and it's one way to provide weighted tail drop thresholds without using its random drop.

If you do use its random drop capability, you really need to set min and max thresholds, as needed, which Cisco defaults often do not (also defaults can vary much between platforms, or possibly IOS version too). The other RED parameter options probably should be left alone. (Even I generally don't touch the latter as the theory for setting them is complex. [Sort of like a CRC computation determination.])

Joseph W. Doherty · ‎12-15-2024

Oh, and as a side note, it might help to understand what problem RED was designed to address, which actually for that, it works pretty well.

In the early days of the Internet router ports single FIFO queue would often overflow, which often caused drops to multiple flows. The often caused TCP global rate synchronization. RED was also designed to be processing needs light.

Enterprise networks usually don't carry the same volume of concurrent flows, and we can use multiple queues, such as dropping packets first from less important traffic avoid running out of buffer resources.

Unfortunately, much QoS teaching material doesn't delve into any of this, and sort of leaves the impression it's a good idea to use it, without much consideration.

Ramblin Tech · ‎12-15-2024

Adding to Joe's reference on sync of TCP rates... In theory, WRED can trigger individual TCP flows to throttle back their offered loads in a more gradual, graceful manner than having many flows all throttle back at once as in tail-drop. But the assumption with that is that great majority of Internet traffic volume is still TCP. This assumption was validated by many traffic studies a number of years ago, but some studies now suggest that TCP does not dominate like it once did. Instead, DNS and QUIC traffic volume (both UDP based) might be predominate in the future (or even now).

What would a shift in Internet traffic volume from TCP to UDP mean to the deployment of WRED, which depends on TCP flows playing fairly with each other (ie, not grabbing more than their fair share of the bandwidth)? I have no idea but I would be very interested in seeing academic studies on the effect of WRED on QUIC (I am about 4 years behind in reviewing ACM and IEEE journals in this area, so they may already be published).

Disclaimers: I am long in CSCO. Bad answers are my own fault as they are not AI generated.

Joseph W. Doherty · ‎12-15-2024

@Ramblin Tech wrote:

In theory, WRED can trigger individual TCP flows to throttle back their offered loads in a more gradual, graceful manner than having many flows all throttle back at once as in tail-drop.

Yup, that's the theory. Actually, it can do that for TCP flows.

@Ramblin Tech wrote:

What would a shift in Internet traffic volume from TCP to UDP mean to the deployment of WRED, which depends on TCP flows playing fairly with each other (ie, not grabbing more than their fair share of the bandwidth)?

Well, I actually read studies of TCP vs. UDP, and TCP usually loses out, but its very variable, because some UDP using APPs have their own flow control, and some do not.

Using a network, without any flow control, often has major performance issues. TCP, goes out of its way to try to be, indeed, fair. Further, it addresses many other issues that network apps count on, like reliable delivery of data. I.e. with UDP every app needed to "reinvent the wheel", while TCP provides the usually desired network reliability as part of its protocol.

Over the years, I've seen "special" data transfer apps, that could transfer faster than TCP, but then they generally didn't consider fairness, and so could also clobber your network.

Also, years (decades) ago, I was involved in studying some packet captures, for some network performance issue, and discovered Google servers didn't strictly adhere to TCP slow start.

I haven't studied QUIC at length, but it does appear to have some form of flow control, which may consider lost packets in its flow rate decision making. Unclear, how well it handles really bulk/large data transfers relative to other concurrent traffic. (My suspicion is not well, but then of course APP could use TCP - sure it will.)

The reason I mention Google not adhering to correct TCP slow start, was to make their servers more "responsive" then other competitors servers. I.e. fairness be dammed.

QUIC may be just fine as long as you can keep feeding it bandwidth.

On the Internet, certainly it continues to grow in using application traffic that's time sensitive, not just for things like VoIP, but even transactional kinds of applications, like web browsing. To keep things "quick" requires both bandwidth and low latency.

I know many Internet Providers focus on bandwidth, bandwidth, bandwidth, to avoid any queuing. If there's never any queuing, generally there's no need for QoS. If there is queuing, which is problematic, either you need more bandwidth and/or QoS to manage it.

MHM Cisco World · ‎12-15-2024

Again any QoS can not prevent congestion' it only controls which packet will drop.

Tcp can detect drop packet (by seq number of packet) but udp can not' that why all QoS start drop tcp packet.

QoS can help until one point after that you must start thinking of increasing BW ypu get from SP or change router to high throughput router.

MHM

Joseph W. Doherty · ‎12-15-2024

@MHM Cisco World wrote:

Again any QoS can not prevent congestion' it only controls which packet will drop.

Tcp can detect drop packet (by seq number of packet) but udp can not' that why all QoS start drop tcp packet.

QoS can help until one point after that you must start thinking of increasing BW ypu get from SP or change router to high throughput router.

MHM

As above is worded, I disagree with all, because it seems to be general sweeping statements that are not always true.

If the first paragraph, we would need to clarify exactly what "congestion" and "QoS" mean.

For example, RED's purpose is to avoid a queue overflowing, which many would consider "congestion".

In the second paragraph, 100% correct, UDP, alone, doesn't care about drops, as TCP does. But it's various apps using UDP often do care about drops. So, whether a UDP drop is significant depends on the UDP app too. In other words, you don't know the impact of dropping a UDP packet on it being UDP, alone.

For the last paragraph, although QoS doesn't eliminate all need for additional bandwidth, it can often eliminate the need for additional bandwidth. Conversely, it can be totally useless. So, it's more complex then just avoiding needing additional bandwidth to some point.

Again, your wording is very general and sweeping, also implying QoS offers little benefit. There certainly are cases where all of the above is true, but more, I believe, where it's not.

MHM Cisco World · ‎12-16-2024

Ok' who use QoS ?

1- SP to control the BW for customers

2- VoIP engineer who want his VoIP have good quality

Both cases not prevent congestion it control congestion.

There is different.

The ultimate solution of congestion is solve by

1- increase memory of interface and/or internal buffer of CPU

2- make router faster forward traffic by increasing BW

Thanks and correct me if I am wrong

MHM

Joseph W. Doherty · ‎12-16-2024

(Your solution) #1 often (study queuing theory) only works if average offered load is at 100%, or less, and as you close toward 100%, you can obtain a very, very deep queue.

Consider, I have a 1 TB file and a 100 Gbps link to a router which has a 64 Kbps egress link. So, you propose the 64k interface should have a 1 TB buffer allocation?

Have you ever seen something like the forgoing in action? What's the average transmission rate on the 100g link, using TCP? Is it 100 Gbps, or something less, like closer to 64 Kbps? If the latter, why? Is there anything I can do, on the router, to influence TCP's transmission rate? You appear to say not, correct?

Assuming I can influence the sender's transmission rate, like not having to take the whole 1 TB file at 100 Gbps, am I not also controlling congestion on the 64 Kbps interface?

Let's assume concurrent with the forgoing, another host wants to send VoIP? If I only have a single queue, if I can actually queue up 1 TB, along with the VoIP, will VoIP work well? Hopefully, you'll agree it won't. But, if I can in fact, influence a transmitter's send rate, say I get the data sender to send at only 8 Kbps, would VoIP work then?

As to (your solution) #2, sure if we obtain enough bandwidth, i.e. there's no over subscription, we're fine. So, yup, let's replace our private p2p inter continental link 64 Kbps with 100 Gbps. Laugh, you'll cover the slight delta in cost? Oh, did I mention we have more than one inter continental link, not to mention links on the same continent?

If your experience is in a SP, DC, or LAN environments, often bandwidth and/or simple QoS, to support RT traffic like VoIP, negates extensive QoS.

WANs, though, are often much more bandwidth constained, sometimes due to it's just not physically available, and possibly more often by cost. With lack of bandwidth, QoS can be very useful beyond supporting something like VoIP.

Decades ago, I worked at an international company that had offices all over the world. Lowest end WAN links were 64 Kbps, highest, I had to work with were DS3/E3. LAN connections were gig. Let's say, congestion wasn't rare on the WAN links.

Over a decade I spent much of my time working out effective QoS. My QoS predated RT apps, like VoIP (although it and VidConf, along with streaming video, were added).

Was this worthwhile? Well, it looked like I avoided at least $100,000 a month in data telecom charges by avoiding increasing bandwidths, and user complaints about the slow network greatly diminished.

Was the forgoing just my wishful thinking? Not quite, as two other global regions believed QoS only was needed for VoIP and used your option two. However, they constantly complained about the additional cost especially as their users kept complaining about the slow network.

But, I'll now reveal a major "secret" of effective QoS - everything need not be fast, but predictable/expectable/reasonable.

For example, if I pull a huge file from a server, I'm not surprised it takes longer than a email message, which has "Lunch at noon?". But, when the latter sometimes takes longer than the former, you have very unhappy users.

Joseph W. Doherty · ‎12-16-2024

@Mitrixsen as I replied earlier, WRED would be unlikely used for dealing with congestion between two queues. However, it's a totally different situation for the same class queue, when using WRED (not RED)!

Suppose you had VoIP and FTP mixed in the same FIFO queue. Probably VoIP will very much be adversely impacted when and if FTP tries to obtain all possible bandwidth. The normal, and much better, approach would be to split VoIP and FTP into separate queues and prioritize the VoIP queue. If you do that, as far as the VoIP queue is concerned, it doesn't matter if you use WRED or not in the FTP queue (it can impact FTP, though).

But again, for some strange reason, we're stuck with only having one queue, and our only tool is WRED, so VoIP isn't possible, at all, right? Well, not necessarily.

Suppose we configure WRED to begin to randomly drop FTP packets when there are 2 or more, and drop all FTP packets when there are more than 6? So, more-or-less, we guarantee VoIP will only have somewhere between 2 to 6 packets ahead of it, not 40 or more. Will VoIP work now? That depends on things like the transmission bandwidth of the port, i.e. how long might the VoIP packet be delayed. Remember, VoIP does have some tolerance for delay (latency), jitter and even drops.

Oh, and a factoid (that drove me crazy in my early QoS endeavors, as I didn't initially know), Cisco interfaces often have a hardware FIFO queue, and it's only its overflow that can obtain its prioritized dequeuing by a QoS policy.

(NB: what drove me crazy, the embedded hardware FIFO queue caused my QoS policy not to work very well. Once I found out about this hardware queue, ah, that explained the poor QoS effect. Fortunately, that hardware queue can be considerably reduced from its default size, i.e. allowing the QoS policy good effect. [Also note: this little factoid, often not much mentioned in Cisco QoS literature, possibly goes a long way in explaining why QoS isn't held in very high regard - as the hardware FIFO queue, if not downsized, so much negates the impact of a QoS policy.])

So, the forgoing is an example of using WRED to avoid congestion of the interface. (By "avoid", for a TCP app like FTP, the drops will have the sender back-off its transmission rate. If the other app was something UDP based, which did not have any app flow rate control, ingress to the router wouldn't change, but the egress interface queue depth would still be small. For interface queue depth, the latter would probably maintain an average of 6 non-VoIP packets in the queue, whereas with FTP, it would fluctuate with FTP's actual transmission rate. Again, where you'll see a huge difference, is the "offered" rate.)

BTW, I believe @MHM Cisco World appreciates, in a scenario as above, QoS may "control" congestion, for either TCP or UDP. However, QoS can also prevent congestion, by somehow "signally" sender to slow their transmission rate. Classical TCP did this by considering lost packets all due to congestion (although they might have been lost for other reasons). Modern variants of TCP now also watch for spikes in RTT, and assume those are due to congestion. For either, TCP backs off its transmission rate, in half, or goes back into slow start.

Many other non-TCP APPs, have their own ways of detecting and dealing with congestion and responding. For example, some streaming apps will drop to a lower quality, and lower bandwidth demanding video stream.

How well something like WRED will interact with them, depends much on the app. As noted by @Ramblin Tech RED's goal was trying for a nice back off of one or few flows, instead of slamming all the concurrent flows. But, some of these non-TCP flows many consider the loss of just one packet, not due to congestion, but due to some other cause, such as in-flight data corruption or transient misrouted packet. For such, they won't slow and may, or may not, retransmit the lost packet's data.

I've often stated, QoS isn't really too complex, but it does require a wide breadth of information, to use it well.

I'm retired pushing 8 years now. Off the top of my head, I have no idea how any kind of dropping methodology will impact Jim's QUIC, but if I were going to implement QoS for QUIC, I would find out before I wrote a QoS policy that handled that kind of traffic.

MHM Cisco World · ‎12-22-2024

Sorry for late reply

QoS can signal the peer for Queue congestion.

That not totally right'

End to end QoS like rsvp is use to reserve specific BW of Link for VoIP it not use to prevent congestion.

Just want to clear this point.

Thanks

MHM

Joseph W. Doherty · ‎12-22-2024

@MHM Cisco World wrote:

Sorry for late reply

QoS can signal the peer for Queue congestion.

That not totally right'

End to end QoS like rsvp is use to reserve specific BW of Link for VoIP it not use to prevent congestion.

Just want to clear this point.

Thanks

MHM

Sorry, at least to me, unclear what point you're trying to make clear.

If you're saying QoS always "signals" the sender there's queue congestion, and that's not totally correct, I fully agree. I didn't mean to imply that, don't think I actually wrote that, either, but, again, certainly didn't mean to imply that.

I am saying, when there's queue congestion, often QoS techniques can "influence" a sender to slow their transmission rate.

Understand, though, I've a very expansive definition of QoS.

For example, take the situation of a single global egress FIFO queue. Can there be QoS considerations?

I believe there are, concerning how deep you permit the queue to be, i.e. the queue limit.

@MHM Cisco World how do you determine what a single global FIFO egress queue limit should be? Do you ever care, or do you take whatever the device uses as a default? Even if you use the device's default queue limit, hmm, do you think its value was just chosen randomly?

Then we can also discuss what truly constitutes "signally". Does it always need to be something explicit such as, ECN, or might loss of packets, indirectly, provide a signal (of congestion)? The latter is a bit germane to the subject of WRED, because it drops packets to try to accomplish a specific result. Did you think, there's no "science" behind the values for min, max, drop percentage and timing computations?

If WRED has some science behind, do you believe it's not possible even a FIFO queue limit might too?

If you're ignorant of such science, that's not unusual because typical QoS is taught as you need it for VoIP and video and can use WRED for "better" dropping, anything beyond that, "you need more bandwidth".

Again, since I have an expansive view of QoS, here's a real world problem I was involved with. . .

At the time I was working for an international company, they were doing weekend database "cloning" across the North Atlantic, and had a timing constraint, the clone replica's data had to be started and completed within the weekend.

Initially they had multiple T1/E1 transAtlantic channels, but somewhat like Etherchannel, the replication only used one path, and it was insufficient bandwidth. So, they traded in their T1/E1 channels for a single T3/E3 channel, which, by their calculations, had ample bandwidth to allow the replication to complete within the required timeframe.

Problem was, the replication refused to use all the available bandwidth! They were going crazy trying to figure out the problem, as everything on the network looked just fine.

When I was made aware of this problem, I immediately suspected what the issue was, and proposed a couple of ways to work around the issue. They accepted my recommended approach, although they were initially aghast and what I suggested, but they did it anyway. It did solve the problem.

Any idea what the root cause was and possible solutions?

root cause:

Spoiler

LFN (long fat network) BDP (bandwidth delay product)

the aghast recommendation:

Spoiler

increase TCP RWIN on receiving host from 16 KB (its default) to 64 KB

Regarding RSVP to preclude congestion; well, not necessarily. All it really does, is determine if some specific amount of bandwidth is available end-to-end, and if so, guarantee it to the requestor. The requestor, is still, I believe, at liberty to oversubscribe the requested end-to-end bandwidth.

I've never used RSVP, although, conceptionally, I understand it's benefit.

Something, similar, could be done, statically (and dynamically?), with ATM using its PVCs, in specific modes, for specific traffic, like a PVC dedicated for VoIP.

However, if I had complete management to bandwidth, I found Cisco QoS support, almost always sufficient to accomplish my service goals without using either.