Solved: (Auto)QoS drops more packets on higher capacity uplink

jhenkenhaf · ‎02-21-2020

Hi Everyone,

recently I was asked to troubleshoot a networking issue, that I got stumped with. A customer that I work with, got complaints about slow file access, so we began digging. Drilling it down to, the problem seems to be related to QoS. Following Setup:

Fileserver attached to Backbone via 10G
Backbone/Distribution connected via Channel 4x10G
Access (2960X) connected to Distri via Chanel 2x10G or 2x1G

The first thing we found out was, that the slow access is only measureable on Access-Switches that are connected on 10G Uplinks. So we went a little deeper and set up a testing scenario.

Clean PC with CrytalDisk for measuring read/write rates
Access-Switch with no other used ports connected to distri (with options either 1G or 10G)
Pretty much blank/basic configuration, only enabled "auto qos trust dscp" on interface

So starting from the exact same base we saw:

writing rates to server 1G flat, no matter our setup
reading rates from server:
- 1G flat when using 1G uplink
- 300-600MB when using 10G uplink
switching up software versions on the Switch (15.2.7(E1) and 15.2.4(E8)) did not change anything

This was very confusing, but we continued searching and started suspecting QoS. Next test we disabled QoS on our test-switch and poof, 1G flat, no matter which uplink was used. (But we cannot disable QoS on the infrastructure, as this will lead to other negative effects in the VoIP setup)

So the digging continued, with my limited Know-How on QoS:

"show mls qos int gi1/0/x statistics" as well as "show interface 1/0/x | i drops" and a wireshark setup were my tools. They showed me that outputs were dropped on the output queues (queue 2), on the interface (exact same number of output packet drops as queue drops) and wireshark told me about tcp-retransmissions, out of order, dup-ack and more.

My detective senses told me, that this would make sense, output drops on the interface correlates to reading speed for the client. TCP retransmissions and window-size adjustments, of course we have reduced speed while reading.

My solution for the time beeing:

Okay, if packets in queue x are dropped, I should increase the bandwidths and thresholds for this queue. So I did experiment a little by adjusting:

Interace parameters (mls/aut qos settings - no change, bandwidth share settings - no change)
global QoS output buffer parameters (from 15 25 40 20 to 10 15 65 10, increasing queue 2 - a little better)
global QoS queue thresholds for the threshold that showed dops (from 100 100 100 400 step by step to 800 800 100 1200 - a lot better)

Or in numbers, for the same read/write process (these were taken on the clients access-interface, uplink did not show drops in any situation):

Original QoS values:

output queues dropped:
queue:    threshold1   threshold2   threshold3
-----------------------------------------------
queue 0:           0           0           0
queue 1:           0           0           0
queue 2:           0           0       21516
queue 3:           0           0           0

adjusted:

output queues dropped:
queue:    threshold1   threshold2   threshold3
-----------------------------------------------
queue 0:           0           0           0
queue 1:           0           0           0
queue 2:           0           0         284
queue 3:           0           0           0

So almost a factor 100 less queue drops on our data-queue, which of course resultet in much better reading rates (pretty much flat 1G)

Now I am stumped with a few questions:

Why does it behave different, depending on the uplink bandwidth
- shouldn't higher bandwidth just not affect my access-port?
Are the values that I experimented with reasonable (probably difficult, because of other dependancies in the network)

mls qos queue-set output 1 buffers 15 25 40 20
to
mls qos queue-set output 1 buffers 10 15 65 10

mls qos queue-set output 1 threshold 3 100 100 100 400
to
mls qos queue-set output 1 threshold 3 800 800 100 1200

Is there a good way to monitor if changes like this affect other traffic?
If I want to know more about why the drops happen
- where can I go?
- what debugging/troubleshooting commands can I use?

The 10G - 1G behaviour cracks me up the most, but I am happy for any feedback that I can get. Hopefully the information above is relayed in a understandable way, if not I will happily elaborate.

Kind regards,

Jochen

Joseph W. Doherty · ‎02-21-2020

Increasing the logical queue thresholds (I sometimes configure max values) often is one of the most effective changes to reduce drops on the Catalyst 2K/3K switches. On the switches that "reserve" buffers per port egress queues, I've often found reducing the reservations to the port (and allowing the buffers to go into the shared pool) often also effectively reduces the drop rate (I sometimes go with minimum possible port reservation).

As to why a higher bandwidth uplink might increase drops, a higher bandwidth link can allow a quicker build up of a queue upon a lower bandwidth link. In some situations I've suggested reducing bandwidth, along a path, such that they are more uniform end-to-end.

View solution in original post

Joseph W. Doherty · ‎02-21-2020

Increasing the logical queue thresholds (I sometimes configure max values) often is one of the most effective changes to reduce drops on the Catalyst 2K/3K switches. On the switches that "reserve" buffers per port egress queues, I've often found reducing the reservations to the port (and allowing the buffers to go into the shared pool) often also effectively reduces the drop rate (I sometimes go with minimum possible port reservation).

As to why a higher bandwidth uplink might increase drops, a higher bandwidth link can allow a quicker build up of a queue upon a lower bandwidth link. In some situations I've suggested reducing bandwidth, along a path, such that they are more uniform end-to-end.

jhenkenhaf · ‎02-25-2020

Hi Joseph,

thank you for the additional input and suggestions.

From my understanding of our port-configuration (srr-queue bandwidth share 1 30 35 5) i understood that reservations do not matter (as long as there is only traffic in one queue). As for I read in the documentation, that in shared mode, buffers get shared (duh) amongst the queues. So as long as only one queue is active, full bandwidth should be available.

Your conclusion, that higher bandwidth allows quicker build up on other queues, is the one that I was suggesting to the customer, earlier this week. I asked him to test this by using the 1G uplink (that showed no issue) and forcing 100M on the Access-Port. That way we should have the sam 10:1 scale and maybe can replicate the drops.

One other question:

Do you have any effective ways of measuring/verifiying if other network operations get influenced, by increasing the queue thresholds?

Thanks in any way for the feedback, I will go back to labbing this as soon as possible :)

Kind regards,

Jochen

Joseph W. Doherty · ‎02-26-2020

Yes, in share mode any queue, or combination of queues, can use all available bandwidth. However, on some platforms, when QoS is active, some buffers are actually reserved to port queues and cannot be shared. However, such reservations can be changed via configuration.

jhenkenhaf · ‎02-27-2020

Okay, looks like I still got much to learn, about the intricacies of QoS.

For now I'll be content with testing out the theory of 10:1 scaling in buffer behaviour.

Mind sharing some more input, about my question regarding monitoring?

So far I only found the showing of statistics, that displays drops in each queue (but no reasons, or logging messages)

Debugging the QoS features didn't show any results for me.

Thanks for your input so far, have a great day ahead.