cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
10156
Views
25
Helpful
9
Replies

Cisco 9300 Output Drops and QoS

Jan Gilhooley
Level 1
Level 1

 

I really hope someone can point me in the right direction over this one – in one form or another it’s quite longstanding and causing me quite a headache.

 

We are having problems with poor video conferencing performance in various parts of the network – overall bandwidth is absolutely fine (internal and our internet connection), however I do see output drops on various client interfaces on the edge switches. Like many places we are relying much more on video conferencing than we did in the past so this is causing some operational difficulties.

 

The example in question is a stack of 6 Cisco 9300 switches running 16.12.3a CAT9K_IOSXE.

 

This is the switch port:

TwoGigabitEthernet1/0/14 is up, line protocol is up (connected)

  Hardware is Two Gigabit Ethernet, address is dcf7.1951.ab0e (bia dcf7.1951.ab0e)

  MTU 1500 bytes, BW 100000 Kbit/sec, DLY 100 usec,

     reliability 255/255, txload 1/255, rxload 1/255

  Encapsulation ARPA, loopback not set

  Keepalive set (10 sec)

  Full-duplex, 100Mb/s, media type is 100/1000/2.5GBaseTX

  input flow-control is on, output flow-control is unsupported

  ARP type: ARPA, ARP Timeout 04:00:00

  Last input 00:00:18, output 00:00:00, output hang never

  Last clearing of "show interface" counters 23:39:36

  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 5164

  Queueing strategy: Class-based queueing

  Output queue: 0/40 (size/max)

  5 minute input rate 0 bits/sec, 0 packets/sec

  5 minute output rate 42000 bits/sec, 8 packets/sec

     418249 packets input, 111499295 bytes, 0 no buffer

     Received 3190 broadcasts (2696 multicasts)

     0 runts, 0 giants, 0 throttles

     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

     0 watchdog, 2696 multicast, 0 pause input

     0 input packets with dribble condition detected

     1003190 packets output, 621482209 bytes, 0 underruns

     Output 81177 broadcasts (0 multicasts)

     0 output errors, 0 collisions, 0 interface resets

     0 unknown protocol drops

     0 babbles, 0 late collision, 0 deferred

     0 lost carrier, 0 no carrier, 0 pause output

     0 output buffer failures, 0 output buffers swapped out

SW#

 

And its config:

SW#sho run int Tw1/0/14                              

Building configuration...

 

Current configuration : 605 bytes

!

interface TwoGigabitEthernet1/0/14

 switchport access vlan 31

 switchport mode access

 switchport voice vlan 232

 switchport port-security maximum 2

 switchport port-security violation restrict

 switchport port-security aging time 2

 switchport port-security aging type inactivity

 switchport port-security

 device-tracking attach-policy IPDT_MAX_10

 trust device cisco-phone

 auto qos voip cisco-phone

 macro description cisco-phone

 spanning-tree portfast

 spanning-tree bpduguard enable

 service-policy input AutoQos-4.0-CiscoPhone-Input-Policy

 service-policy output AutoQos-4.0-Output-Policy

end

SW#

 

I’m pretty sure that we have a QoS problem – generally for the edge switches AutoQos has been turned on and configured with a “auto qos voip cisco-phone” statement (we of course have a Cisco VoIP system). When I look into the QoS queues I get something like this:

 

SW#sho platform hardware fed switch 1 qos queue stats interface tw1/0/14

----------------------------------------------------------------------------------------------

AQM Global counters

GlobalHardLimit:  4076   |   GlobalHardBufCount: 0

GlobalSoftLimit: 15772   |   GlobalSoftBufCount: 0

 

----------------------------------------------------------------------------------------------

Asic:1 Core:0 DATA Port:13 Hardware Enqueue Counters

----------------------------------------------------------------------------------------------

 Q Buffers          Enqueue-TH0          Enqueue-TH1          Enqueue-TH2             Qpolicer

   (Count)              (Bytes)              (Bytes)              (Bytes)              (Bytes)

-- ------- -------------------- -------------------- -------------------- --------------------

 0       0                    0             89608573             62651856                    0

 1       0                    0               236854             23173730                    0

 2       0                    0                    0                    0                    0

 3       0                    0                    0                    0                    0

 4       0                    0                    0                    0                    0

 5       0                    0                    0                    0                    0

 6       0                    0                    0                    0                    0

 7       0                    0                    0           3955932288                    0

Asic:1 Core:0 DATA Port:13 Hardware Drop Counters

--------------------------------------------------------------------------------------------------------------------------------

 Q             Drop-TH0             Drop-TH1             Drop-TH2             SBufDrop              QebDrop         QpolicerDrop

                (Bytes)              (Bytes)              (Bytes)              (Bytes)              (Bytes)              (Bytes)

-- -------------------- -------------------- -------------------- -------------------- -------------------- --------------------

 0                    0                    0                    0                    0                    0                    0

 1                    0                    0                    0                    0                    0                    0

 2                    0                    0                    0                    0                    0                    0

 3                    0                    0                    0                    0                    0                    0

 4                    0                    0                    0                    0                    0                    0

 5                    0                    0                    0                    0                    0                    0

 6                    0                    0                    0                    0                    0                    0

 7                    0                    0             32468215                    0                    0                    0

SW#

 

This looks like most traffic is hitting the policy-map “class-default” and being placed in Q7. However when I look into the queues I find that it appears that there are 500 buffers allocated to Q7 (and most of the traffic – whereas a non-AutoQoS port shows 800 buffers in Q0 and 1200 in Q1 for all the traffic):

SW#sho platform hardware fed switch 1 qos queue config interface tw1/0/14

Asic:1 Core:0 DATA Port:13 GPN:14 LinkSpeed:0x1

AFD:Disabled FlatAFD:Disabled QoSMap:0 HW Queues: 104 - 111

  DrainFast:Disabled PortSoftStart:5 - 750

   DTS  Hardmax  Softmax   PortSMin  GlblSMin  PortStEnd

  ----- --------  --------  --------  --------  ---------

 0   1  4    75   9    75   0     0   0     0   6  1000

 1   1  0     0  10   120  32   120  13    48   6  1000

 2   1  0     0  11   200  19   118   8    50   6  1000

 3   1  0     0  11   200  19   118   8    50   6  1000

 4   1  0     0  11   200  19   118   8    50   6  1000

 5   1  0     0  11   200  19   118   8    50   6  1000

 6   1  0     0  11   200  19   118   8    50   6  1000

 7   1  0     0  12   500  19   296   8   125   6  1000

 Priority   Shaped/shared   weight  shaping_step  sharpedWeight

 --------   -------------   ------  ------------   -------------

 0      1     Shaped          8500         255           0

 1      7     Shared           125           0           0

 2      7     Shared           125           0           0

 3      7     Shared           125           0           0

 4      7     Shared           312           0           0

 5      7     Shared          1250           0           0

 6      7     Shared           125           0           0

 7      7     Shared            50           0           0

 Port       Port            Port    Port

 Priority   Shaped/shared   weight  shaping_step

 --------   -------------   ------  ------------

        2     Shaped          2560         255

 

   Weight0 Max_Th0 Min_Th0 Weigth1 Max_Th1 Min_Th1  Weight2 Max_Th2 Min_Th2

   ------- ------- ------- ------- ------- -------  ------- ------- ------

 0       0     119       0       0     133       0       0     150       0

 1       0      95       0       0     106       0       0     120       0

 2       0     159       0       0     178       0       0     200       0

 3       0     159       0       0     178       0       0     200       0

 4       0     159       0       0     178       0       0     200       0

 5       0     159       0       0     178       0       0     200       0

 6       0     159       0       0     178       0       0     200       0

 7       0     398       0       0     445       0       0     500       0

SW#

 

So what I *think* I am seeing here is a packet drop because Threshold2 on Q7 (500 buffers) is being exceeded – and no packets are being matched to queues 2 – 6. Is my reading of this correct?

 

Also when I have been looking around I’ve noticed that we also have a packet drop due to a control-plane policy:

 

SW#sho pl ha fed sw 1 qo qu st intern cpu pol

 

                         CPU Queue Statistics                 

============================================================================================

                                              (default) (set)     Queue        Queue

QId PlcIdx  Queue Name                Enabled   Rate     Rate      Drop(Bytes)  Drop(Frames)

--------------------------------------------------------------------------------------------

<snip>

21   13     LOGGING                     Yes     1000      1000     5024815      5931      

22   7      Punt Webauth                Yes     1000      1000     0            0         

23   18     High Rate App               Yes     13000     13000    2341704124   2524214   

<snip>

 

* NOTE: CPU queue policer rates are configured to the closest hardware supported value

<snip>

#

 

My questions on this are

  • Is the packet drop due to the “High Rate App” CPU queue rate of 13000 being exceeded separate from the Output drops I see on the edge ports?
  • Is this a “bad rate” or “normal” or whatever? Basically – should I worry about it?
  • Am I right in thinking that because this edge switch is operating at L2 all packets have to be sent (punted?) to the switch CPU for processing (L2 Mode Access to the client ports, Trunk port to the uplink to the Distribution Switch that has the VLAN SVIs, routing, etc). I.e. no CEF or similar? When I do sho interface stats all traffic (except for the switch management IP on its management VLAN) is showing as “Processor” for “Switching Path” – and only “Pkts Out”. I assume that does means “send all packets to switch CPU”.

 

This is the first time I’m having to really look at and understand what is happing on a QoS level – its certainly deep stuff. All I know is we have multiple very unhappy users who have problems with MS Teams (and other) video calls!

 

Thanks

 

Jan

1 Accepted Solution

Accepted Solutions


@Arne Bier wrote:

Will it reboot the switch (or require a reboot?) ?


Only if you want to clear the counters or if the switch is hitting CSCvd38417.

NOTE:  

  • The command is very "version specific".  Switches running 3.X.X will not benefit from this command.
  • CSCvg89791 - Configuring "qos queue-softmax-multiplier" causes stackwise-virtual members to split or crash
  • CSCvs20038 - qos softmax setting doesn't take effect on Catalyst switch in Openflow mode

 

 

 

View solution in original post

9 Replies 9

Joseph W. Doherty
Hall of Fame
Hall of Fame

First two disclaimers. My experience with Cisco edge switches, more-or-less, stops with 3750-X series, and I haven't really fully analyzed the information you've posted.

That said, starting with the 3750, and the following on 2K and 3K Catalyst (user) edge switches, and perhaps the Catalyst 9300, Cisco's QoS architecture approach, by default, has been to "reserve" buffers to ports, and set buffers limits, perhaps a tad low. So, these switches often "appear" to have an issue with high egress port drops for especially bursty traffic.

The default settings seem to be especially troublesome for device defaults. Auto-QoS configurations, though, seem to be somewhat better. However, some adjustment of buffer settings can sometimes provide a huge reduction in output drops. (For example, I had a case where a couple of ports [on a 3750] were getting multiple drop per second, but after changing some buffer settings, those same ports only had a few drops per day!)

The first setting you normally want to increase in the logical buffer limits. I've often gone to max values.

The second setting (if applicable on a 9300) is to change how buffers are reserved. Rather than "reserving" them to the port, I decrease the port reservations to allow the buffer resources to go into the "shared" buffer pool. (The latter is more the QoS architecture approach of the Catalyst 3500 series, which often seems, to me, a better approach.)

There's more you might do, but further changes become a bit exotic.

Again, as I'm unfamiliar with the 9300, unable to provide the actual commands to effect the above, but likely others will suggest similar changes. (If no one else provides them, post again to this thread, and I'll research the commands for your platform.)

Leo Laohoo
Hall of Fame
Hall of Fame
qos queue-softmax-multiplier 1200

Add this.

I believe Leo's recommendation addresses my first point. I.e., try it.

Hi,

 

Thanks for the responses - is there a simple explanation with what the qos queue-softmax-multiplier 1200 setting does? 

 

On a test switch with a port just setup with auto qos it looks like:

 

BLANKSWITCH#sho pl ha fe sw 1 q q con interface gi1/0/3
Asic:0 Core:1 DATA Port:2 GPN:3 LinkSpeed:0x1
AFD:Disabled FlatAFD:Disabled QoSMap:0 HW Queues: 16 - 23
DrainFast:Disabled PortSoftStart:5 - 750
DTS Hardmax Softmax PortSMin GlblSMin PortStEnd
----- -------- -------- -------- -------- ---------
0 1 4 75 12 75 0 0 0 0 6 1000
1 1 0 0 9 120 32 120 13 48 6 1000
2 1 0 0 10 200 19 118 8 50 6 1000
3 1 0 0 10 200 19 118 8 50 6 1000
4 1 0 0 10 200 19 118 8 50 6 1000
5 1 0 0 10 200 19 118 8 50 6 1000
6 1 0 0 10 200 19 118 8 50 6 1000
7 1 0 0 13 500 19 296 8 125 6 1000
<snip>

When I enter qos queue-softmax-multiplier 1200 the same port looks like:

BLANKSWITCH#sho pl ha fe sw 1 q q con interface gi1/0/3
Asic:0 Core:1 DATA Port:2 GPN:3 LinkSpeed:0x1
AFD:Disabled FlatAFD:Disabled QoSMap:0 HW Queues: 16 - 23
DrainFast:Disabled PortSoftStart:2 - 9000
DTS Hardmax Softmax PortSMin GlblSMin PortStEnd
----- -------- -------- -------- -------- ---------
0 1 4 75 9 75 0 0 0 0 5 12000
1 1 0 0 4 600 6 112 3 56 5 12000
2 1 0 0 5 2400 2 150 1 75 5 12000
3 1 0 0 5 2400 2 150 1 75 5 12000
4 1 0 0 5 2400 2 150 1 75 5 12000
5 1 0 0 5 2400 2 150 1 75 5 12000
6 1 0 0 5 2400 2 150 1 75 5 12000
7 1 0 0 10 6000 2 375 1 187 5 12000
<snip>

 

It really looks like I'm getting something for nothing there! I'm obviously a little bit wary of applying that to a live switch stack without more understanding of what its doing and the implications of doing so. Seems too good to be true somehow :-)

 

Jan

 

It extends the logical buffer limits that can be used.

I didn't find a Cisco document for the 9300, but this document, for the earlier 3850, may help explain.
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/200594-Catalyst-3850-Troubleshooting-Output-dr.html

 

PS:

It's sort of like the ISR interface command, queue-limit.

 

Also, it's not for free, but usually you only benefit from it and don't experience the negatives.  (If that's true, you may be wondering why Cisco doesn't make it the default.  Well, Cisco is very conservative, and if you do bump into the negatives, they can be very bad.  So, Cisco plays safe and uses default parameters that may not help as much as often they could with larger allocations, but conversely, they avoid really bad situations.  [Now you may be wondering about "really bad situations", yea, they exist, but unless you're running all your ports with heavy traffic, concurrently, not very likely.  This was for end user hosts, correct?])

Hi @Leo Laohoo 

 

According to the 3850 documentation, this command will only come into effect if the interface has a QoS service policy attached.

 

Do you know if this is also the case for Cat 9300?

 

The customer doesn't have a QoS config on the switch and I don't want to start doing QoS on top of all of this. In the long run it might be possible to implement AutoQoS for this switch - but for now I want to know if I will benefit by simply adding this global command.

 

Will it reboot the switch (or require a reboot?) ?

 


@Arne Bier wrote:

Do you know if this is also the case for Cat 9300?


We have the command in all IOS-XE:  9300 have QoS enabled globally and there is no way to remove it.  


@Arne Bier wrote:

Will it reboot the switch (or require a reboot?) ?


Only if you want to clear the counters or if the switch is hitting CSCvd38417.

NOTE:  

  • The command is very "version specific".  Switches running 3.X.X will not benefit from this command.
  • CSCvg89791 - Configuring "qos queue-softmax-multiplier" causes stackwise-virtual members to split or crash
  • CSCvs20038 - qos softmax setting doesn't take effect on Catalyst switch in Openflow mode

 

 

 

Joseph W. Doherty
Hall of Fame
Hall of Fame

"Am I right in thinking that because this edge switch is operating at L2 all packets have to be sent (punted?) to the switch CPU for processing . . ."

Generally, L2 frames don't ever needed the CPUs attention.

BTW, in general, real-time video's service requirements are much like VoIP bearer traffic, i.e. minimum latency, jitter and no drops, but often much more bandwidth intensive and also often much more bandwidth demand variability.

Because of the growth of real-time video, Cisco later QoS implementation often have two PQ levels, one for VoIP bearer, the second for real-time video.  This to protect VoIP from the real-time video.

 

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: