Re: QOS processing question and then a QOS tagging Netflow question.

hemmerling · ‎07-29-2021

I am tagging all user ingress traffic at their ports via input policy-maps and queueing egress traffic to trunks and user ports with an output policy. I'm seeing a bunch of dropped traffic for certain types of tagged traffic even when there is bandwidth still available the port.

My output policy-map is this (on our 3650s):

policy-map OUR-QOS-POLICY-OUTPUT
  class OUR-QOS-OUTPUT-PRIORITY-QUEUE-1
  priority level 1
  police rate percent 15
 class OUR-QOS-OUTPUT-PRIORITY-QUEUE-2
  priority level 2
  police rate percent 5
 class OUR-QOS-OUTPUT-QUEUE-6 
!Network/Signaling/AD
  bandwidth remaining percent 25
  queue-buffers ratio 10
 class OUR-QOS-OUTPUT-QUEUE-5 
!Teams/WebEX/Jabber
  bandwidth remaining percent 5
  queue-buffers ratio 10
  queue-limit dscp af43 percent 80
  queue-limit dscp af42 percent 90
  queue-limit dscp af41 percent 100
 class OUR-QOS-OUTPUT-QUEUE-4 
!Mission Critical/SQL/Finance Apps
  bandwidth remaining percent 10
  queue-buffers ratio 10
  queue-limit dscp af33 percent 80
  queue-limit dscp af32 percent 90
  queue-limit dscp af31 percent 100
 class OUR-QOS-OUTPUT-QUEUE-3 
!Net Management
  bandwidth remaining percent 10
  queue-buffers ratio 10
  queue-limit dscp af23 percent 80
  queue-limit dscp af22 percent 90
  queue-limit dscp af21 percent 100
 class OUR-QOS-OUTPUT-QUEUE-2 
!Bulk/Scavenger
  bandwidth remaining percent 5
  queue-buffers ratio 10
  queue-limit dscp values af13 cs1 percent 80
  queue-limit dscp values af12 percent 90
  queue-limit dscp values af11 percent 100
 class class-default
  bandwidth remaining percent 25
  queue-buffers ratio 25

I'm seeing drops in OUR-QOS-OUTPUT-QUEUE-2 for AF13 traffic on a 1Gb user port when the volume of that type gets high (but not 800Mb high), and I see drops on OUR-QOS-OUTPUT-QUEUE-6 when the volume is nowhere near 1Gb.
My question is why if the bandwidth is there, are those queues dropping traffic? Does the queue buffer ratio not adhere to the same logic as the bandwidth, in that if the other queues are empty to use their space for the needed queue?
I have tried it with both "qos queue-softmax-multiplier 1200" on and off, same thing, dropped traffic.

The other question is about why if I tag ingress traffic on a 4400 router via an input policy-map that my Netflow exports from that same router show the original TOS tag of 0 (or whatever it may have been before) not the one the policy-map sets? I can only assume that I'm stuck with it working that way because it's likely a result of the order in which the router processes the policy-maps and netflow (hopefully not by oversite but for performance reasons). Is there something I can set that will make Netflow show the traffic processed by the policy-map?
I confirmed on a device downrange from the router that the traffic does in fact have the correct DSCP tags, but my Netflow analyzer still shows the majority of that traffic as CS0.

Any Ideas?

Joseph W. Doherty · ‎07-29-2021

"My question is why if the bandwidth is there, are those queues dropping traffic?"

Because, likely, the bandwidth is NOT there.

Drops happen down in the millisecond time scale, bandwidth usage is often an average over lots of seconds to multiple minutes. Search the Internet on the subject of "micro bursts".

"Is there something I can set that will make Netflow show the traffic processed by the policy-map?"

Possibly. Netflow, I recall, works with the ingress side, which may account for the lack of markings, you note. Later Netflow implementations, I also recall, support working with egress, but you need to enable that on the egress interface. Don't recall what the Netflow interface egress configuration command is, but check you IOS manuals for your version.

hemmerling · ‎08-02-2021

You would think if it was a bandwidth issue that it would show in other queues, it looks like it shows up primarily in the queue that is set the smallest, that is why it seems like it's not able to access the buffers of the queues. Is it possible that the queue-limit goes into effect before it tries to use the other queues available space? Something like once the queue reaches 90% it uses available space elsewhere, but by then it has already starting dropping that above the lowest percentage which is 80%?

  Service-policy output: OUR-QOS-POLICY-OUTPUT

    queue stats for all priority classes:
      Queueing
      priority level 1

      (total drops) 0
      (bytes output) 737146819

    queue stats for all priority classes:
      Queueing
      priority level 2

      (total drops) 0
      (bytes output) 3583348

    Class-map: OUR-QOS-OUTPUT-PRIORITY-QUEUE-1 (match-any)
      2433842 packets
      Match:  dscp ef (46)
      Priority: Strict,

      Priority Level: 1
      police:
          rate 15 %
          rate 150000000 bps, burst 4687500 bytes
        conformed 430662602 bytes; actions:
          transmit
        exceeded 0 bytes; actions:
          drop
        conformed 0000 bps, exceeded 0000 bps

    Class-map: OUR-QOS-OUTPUT-PRIORITY-QUEUE-2 (match-any)
      44552 packets
      Match:  dscp cs4 (32) cs5 (40)
      Priority: Strict,

      Priority Level: 2
      police:
          rate 5 %
          rate 50000000 bps, burst 1562500 bytes
        conformed 3583356 bytes; actions:
          transmit
        exceeded 0 bytes; actions:
          drop
        conformed 0000 bps, exceeded 0000 bps

    Class-map: OUR-QOS-OUTPUT-QUEUE-6 (match-any)
      5314615 packets
      Match:  dscp cs2 (16) cs3 (24) cs6 (48) cs7 (56)
      Queueing

      (total drops) 0
      (bytes output) 1273500537
      bandwidth remaining 25%
      queue-buffers ratio 10

    Class-map: OUR-QOS-OUTPUT-QUEUE-5 (match-any)
      3156776 packets
      Match:  dscp af41 (34) af42 (36) af43 (38)
      Queueing

      queue-limit dscp 34 percent 100
      queue-limit dscp 36 percent 90
      queue-limit dscp 38 percent 80
      (total drops) 0
      (bytes output) 1530405029
      bandwidth remaining 5%
      queue-buffers ratio 10


    Class-map: OUR-QOS-OUTPUT-QUEUE-4 (match-any)
      296755 packets
      Match:  dscp af31 (26) af32 (28) af33 (30)
      Queueing

      queue-limit dscp 26 percent 100
      queue-limit dscp 28 percent 90
      queue-limit dscp 30 percent 80
      (total drops) 0
      (bytes output) 162607860
      bandwidth remaining 10%
      queue-buffers ratio 10


    Class-map: OUR-QOS-OUTPUT-QUEUE-3 (match-any)
      18696983 packets
      Match:  dscp af21 (18) af22 (20) af23 (22)
      Queueing

      queue-limit dscp 18 percent 100
      queue-limit dscp 20 percent 90
      queue-limit dscp 22 percent 80
      (total drops) 0
      (bytes output) 5279799586
      bandwidth remaining 10%
      queue-buffers ratio 10


    Class-map: OUR-QOS-OUTPUT-QUEUE-2 (match-any)
      82035493 packets
      Match:  dscp cs1 (8) af11 (10) af12 (12) af13 (14)
      Queueing

      queue-limit dscp 10 percent 100
      queue-limit dscp 12 percent 90
      queue-limit dscp 14 percent 80
      (total drops) 13995670
      (bytes output) 70412835892
      bandwidth remaining 5%
      queue-buffers ratio 10


    Class-map: class-default (match-any)
      101790019 packets
      Match: any
      Queueing

      (total drops) 0
      (bytes output) 20328511602
      bandwidth remaining 25%
      queue-buffers ratio 25

It's only dropping 1 in 5000 packets in that queue, but none of the other queues are dropping anything.

Any ideas? How would I be able to see if it is in fact bursty traffic?
Is there any documentation on where the queue-limits kick in when other queue space might be available?
Can queues share unused ratio space from other queues?
If it is the queue-limit percentage kicking in too early should I try giving this one queue a higher ratio? Will one of the queues I take it from be able to use it again if needed and the busy queue doesn't happen to be busy at that time?

Joseph W. Doherty · ‎08-02-2021

"You would think if it was a bandwidth issue that it would show in other queues . . ."

Well actually often not because one point of QoS is you prioritize some traffic over other traffic. Basically, when you run out of bandwidth traffic will queue, it's then a matter of what gets dropped first which depends on the traffic and your QoS configuration.

". . . it seems like it's not able to access the buffers of the queues."

That may be the case. Unsure about the 3650/3850 series, as I've never used them, but the prior 3560/3750 series "reserved" buffers for interface and/or different interface queues. (NB: These reservations are often why on the 3560/3750 series, overall drops would increase when QoS enabled [with default] vs. disabled QoS.)

"How would I be able to see if it is in fact bursty traffic?"

Generally, you need something like a packet sniffer to actually see bursts.

"Is there any documentation on where the queue-limits kick in when other queue space might be available?"

Might not be exactly what you're looking for, but have you seen? https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/200594-Catalyst-3850-Troubleshooting-Output-dr.html

"If it is the queue-limit percentage kicking in too early should I try giving this one queue a higher ratio? Will one of the queues I take it from be able to use it again if needed and the busy queue doesn't happen to be busy at that time?"

The prior troubleshooting document might help you with these questions. BTW, your 3650 is more-or-less, a "less powerful" 3850, but tech notes, and such, are often published for the "premier" platform of like/similiar platforms. (Same used to be true with 3560 and 3750s. For info on the 3560, research on the 3750.)