High Output Drops

srpeters18 · ‎02-24-2016

First post in the forum here, really hoping for some assistance.

Here's the background:

Until last year, we were using Cisco 3560G as edge switches and 3750E for distribution at each school site. No QoS was configured. No issues with the Polycom IP phones or the Meraki M34 access points was known or reported. This year (within the last three months) we have replaced all of our switches at all school sites. We're now using 4500X's for the core at each site, and 2960X's for the edge, all running 10Gbps over SM fiber. We also upgraded our WAN links, provided via metro ethernet type service, to 10Gbps. In addition, we have reduced the VLAN sizes from /16's for everything to a single /24 in each IDF and one /24 for voice, a /23 for AP's, and much larger subnets for wireless clients.

All of this has been installed and configured by a VAR. We like our VAR, but have had some issues with the deployment.

Since making the changes above, we've had real problems with QoS. First, all access ports were only able to utilize 25Mbps even though they were hard-wired to 1Gbps ports. This includes the access points - they were trying to share 25Mbps among all users. This was without any congestion on the network. We removed QoS configs from everything, since we didn't do QoS at all before (when we had 1Gbps uplinks and WAN links, instead of 10Gbps). This led to many problems with the phones. We went back and had the consulting engineer develop a QoS config for the access ports - he applied it to everything. This made the phones work quite well, along with everything else on the wired network. However the AP ports, which are trunk ports, were having significant problems. We were back to users having trouble connecting or staying connected.

As a result, he (consulting engineer) removed the QoS configs from the AP ports. This hasn't seemed to help, as we're getting complaints of slowness across the district when users are on wireless. Looking at the switch ports, we are seeing very high output drops, even though they shouldn't be congested based on the throughput.

Output from the one such interface is below (VLAN 104 is for the access point, the others are for clients):

interface GigabitEthernet1/0/35
description ** Access Points
switchport trunk native vlan 104
switchport trunk allowed vlan 104,2000,2004,2008,2024
switchport mode trunk

GigabitEthernet1/0/35 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is ac7e.8a21.5f23 (bia ac7e.8a21.5f23)
Description: ** Access Points
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:13, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 1531223
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 105000 bits/sec, 88 packets/sec
40200882 packets input, 6068261643 bytes, 0 no buffer
Received 1048121 broadcasts (403306 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 403306 multicast, 0 pause input
0 input packets with dribble condition detected
2456332279 packets output, 862907532559 bytes, 0 underruns
0 output errors, 0 collisions, 4 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

And here's our qos config:

mls qos map cos-dscp 0 8 16 24 32 46 48 56
mls qos srr-queue output cos-map queue 1 threshold 3 5
mls qos srr-queue output cos-map queue 2 threshold 3 3 6 7
mls qos srr-queue output cos-map queue 3 threshold 3 2 4
mls qos srr-queue output cos-map queue 4 threshold 2 1
mls qos srr-queue output cos-map queue 4 threshold 3 0
mls qos srr-queue output dscp-map queue 1 threshold 3 40 41 42 43 44 45 46 47
mls qos srr-queue output dscp-map queue 2 threshold 3 24 25 26 27 28 29 30 31
mls qos srr-queue output dscp-map queue 2 threshold 3 48 49 50 51 52 53 54 55
mls qos srr-queue output dscp-map queue 2 threshold 3 56 57 58 59 60 61 62 63
mls qos srr-queue output dscp-map queue 3 threshold 3 16 17 18 19 20 21 22 23
mls qos srr-queue output dscp-map queue 3 threshold 3 32 33 34 35 36 37 38 39
mls qos srr-queue output dscp-map queue 4 threshold 1 8
mls qos srr-queue output dscp-map queue 4 threshold 2 9 10 11 12 13 14 15
mls qos srr-queue output dscp-map queue 4 threshold 3 0 1 2 3 4 5 6 7
mls qos queue-set output 1 threshold 1 138 138 92 138
mls qos queue-set output 1 threshold 2 138 138 92 400
mls qos queue-set output 1 threshold 3 36 77 100 318
mls qos queue-set output 1 threshold 4 20 50 67 400
mls qos queue-set output 2 threshold 1 149 149 100 149
mls qos queue-set output 2 threshold 2 118 118 100 235
mls qos queue-set output 2 threshold 3 41 68 100 272
mls qos queue-set output 2 threshold 4 42 72 100 242
mls qos queue-set output 1 buffers 10 10 26 54
mls qos queue-set output 2 buffers 16 6 17 61
mls qos

This is a standard access port interface:

interface GigabitEthernet2/0/48
description ** Hostport
switchport access vlan 1022
switchport mode access
switchport voice vlan 1110
srr-queue bandwidth share 10 10 60 20
queue-set 2
priority-queue out
snmp trap mac-notification change added
snmp trap mac-notification change removed
mls qos trust cos
auto qos voip trust
spanning-tree portfast

This is the uplink interface:

interface TenGigabitEthernet2/0/1
description ** PO to Core
switchport mode trunk
switchport nonegotiate
priority-queue out
mls qos trust dscp
channel-group 1 mode active

I, admittedly, know very little about QoS. The guy configuring it for us was a CCIE. Still, something strikes me as wrong about our QoS config. We have Polycom phones that mark packets the same way as Cisco phones. We don't have any voice traffic over wireless as we don't even configure voice applications in our VoIP environment (no Jabber or mobility equivalents).

Users are extremely frustrated because we've been promising that the upgrades would be worth the hassle as we move all of the backbone hardware to 10Gbps. Instead it's slower and they have more problems than the old infrastructure.

Philip D'Ath · ‎02-24-2016

I feel like this will be a haystack, and we are looking for a needle.

Try removing the QoS policies on the 10Gb/s switch links as well for a test. Are you getting packet drops on those circuits as well?

The output drops are something very tangible. So maybe we should start by trying to understand them. Then we can build on our knowledge of the fault. The below link is a guide to work out what the drops are in a QoS environment (this is assuming it is because a QoS queue is being overwhelmed).

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-3750-series-switches/116089-technote-switches-output-drops-qos-00.html

Once you have determined the queue/QoS marking that is getting the traffic drops try and determine what traffic is in the queue. Is the traffic in it meant to be in it (perhaps traffic is not being marked correctly). If the traffic that is in the queue is correct then the qos queue configuration probably needs to be adjusted.

There is also a chance you have a unicast/multicast/unknown flood happening. To test this idea plug a notebook into an ordinary network port. Capture the traffic for 5 minutes. Try and not to generate any traffic on the notebook. Once done look at the top traffic types. Is it broadcast or multicast traffic? Or is it traffic to an unknown IP address (which causes the traffic to be flooded). Ideally you should only see traffic destined for your machine and not too much else.

I am also concerned about the 10Gb/s backhaul links. When you say users are experiencing poor throughput - is this over the Metro Ethernet circuits back to some data centre, or is it also within the same site?

srpeters18 · ‎02-24-2016

First, thank you very much for responding.

I found some additional information. First, the QoS configs are apparently only on the uplink port on the edge switches. The core 4500X switch at the school site does not have QoS on the uplink ports. It does have it on the WAN uplink. Of the two port configs below, the first is the uplink from the core to the edge switch, the second is the VPS link. Below that, I've included the service-policy configs on the site core.

interface TenGigabitEthernet2/3
description ** PO to AHS-718IDF-01
switchport trunk allowed vlan 1,2,11,100-104,113-118,200,1022,1110,2000,2004
switchport trunk allowed vlan add 2008,2024
switchport mode trunk
switchport nonegotiate
channel-group 4 mode active

AHS-MDF-4500X-CORE#sh run int t2/8
Building configuration...

Current configuration : 191 bytes
!
interface TenGigabitEthernet2/8
description ** VPLS
no switchport
ip address 10.253.1.31 255.255.255.0
ip directed-broadcast 102
speed nonegotiate
service-policy output qos-OUT

This is the only QoS configuration on the core switch:

class-map match-any qos-out-SCAVENGER
match ip dscp cs1
class-map match-any qos-out-CALL-SIGNALING
match ip dscp cs3
class-map match-any qos-out-VOICE-BEARER
match ip dscp ef
!
policy-map qos-OUT
class qos-out-VOICE-BEARER
priority
class qos-out-CALL-SIGNALING
bandwidth remaining percent 2
class qos-out-SCAVENGER
bandwidth remaining percent 1
class class-default
bandwidth remaining percent 97

Here's the output from the commands recommended in the document you linked.

AHS-718IDF-01#sh mls qos int g1/0/15 stat
GigabitEthernet1/0/15 (All statistics are in packets)

dscp: incoming
-------------------------------

0 - 4 : 14020868 0 1 0 24955
5 - 9 : 0 0 0 231633 0
10 - 14 : 1 0 0 0 0
15 - 19 : 0 2 0 0 0
20 - 24 : 2 0 0 0 0
25 - 29 : 0 654 0 0 0
30 - 34 : 0 0 0 0 0
35 - 39 : 0 0 0 0 0
40 - 44 : 0 0 0 0 0
45 - 49 : 0 4312 0 13939 0
50 - 54 : 0 0 0 0 0
55 - 59 : 0 903 0 0 0
60 - 64 : 0 0 0 0
dscp: outgoing
-------------------------------

0 - 4 : 14469861 0 0 0 24
5 - 9 : 0 0 0 110167616 0
10 - 14 : 0 0 0 0 0
15 - 19 : 0 16640 0 0 0
20 - 24 : 0 0 0 0 0
25 - 29 : 0 9482 0 0 0
30 - 34 : 0 0 0 0 0
35 - 39 : 0 0 0 0 0
40 - 44 : 0 0 0 0 0
45 - 49 : 0 15081 0 2109813 0
50 - 54 : 0 0 0 0 0
55 - 59 : 0 0 0 0 0
60 - 64 : 0 0 0 0
cos: incoming
-------------------------------

0 - 4 : 74838467 0 0 0 0
5 - 7 : 0 0 0
cos: outgoing
-------------------------------

0 - 4 : 1813649360 82329468 527303 65759 238445
5 - 7 : 84576 2171128 8296670
output queues enqueued:
queue: threshold1 threshold2 threshold3
-----------------------------------------------
queue 0: 0 0 15081
queue 1: 16987276 62682246 10018924
queue 2: 0 0 16612
queue 3: 82262545 0 1749434246

output queues dropped:
queue: threshold1 threshold2 threshold3
-----------------------------------------------
queue 0: 0 0 0
queue 1: 21341 1 0
queue 2: 0 0 0
queue 3: 2306141 0 2381

Policer: Inprofile: 0 OutofProfile: 0

AHS-718IDF-01#sh mls qos maps dscp-output-q
Dscp-outputq-threshold map:
d1 :d2 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------------
0 : 04-03 04-03 04-03 04-03 04-03 04-03 04-03 04-03 04-01 04-02
1 : 04-02 04-02 04-02 04-02 04-02 04-02 03-03 03-03 03-03 03-03
2 : 03-03 03-03 03-03 03-03 02-03 02-03 02-03 02-03 02-03 02-03
3 : 02-03 02-03 03-03 03-03 03-03 03-03 03-03 03-03 03-03 03-03
4 : 01-03 01-03 01-03 01-03 01-03 01-03 01-03 01-03 02-03 02-03
5 : 02-03 02-03 02-03 02-03 02-03 02-03 02-03 02-03 02-03 02-03
6 : 02-03 02-03 02-03 02-03

AHS-718IDF-01#sh mls qos que
AHS-718IDF-01#sh mls qos queue-set
Queueset: 1
Queue : 1 2 3 4
----------------------------------------------
buffers : 10 10 26 54
threshold1: 138 138 36 20
threshold2: 138 138 77 50
reserved : 92 92 100 67
maximum : 138 400 318 400
Queueset: 2
Queue : 1 2 3 4
----------------------------------------------
buffers : 16 6 17 61
threshold1: 149 118 41 42
threshold2: 149 118 68 72
reserved : 100 100 100 100
maximum : 149 235 272 242

I really am unfamiliar with QoS so if there's something here that's glaring, please feel free to point it out. To my untrained eye, it looks like we're only using two of the four queues available. Since there aren't any mls qos commands on the access point port however, I would have assumed that it would divide all traffic across all four queues, which does not appear to be the case.

I can get to the site with a laptop to do a capture tomorrow afternoon, at the earliest.

We only seem to be having problems with throughput at the site, and really only on the links that have access points on them. I ran a quick speed test from speedtest.net this afternoon from a server plugged into the core switch at one site, and was getting 1.2 Gbps across the WAN. I think this was a limitation of the remote web server and not the WAN link.

Philip D'Ath · ‎02-24-2016

So to be crystal clear, the performance issue is only on the WiFi connections?

If a single person is connected to the WiFi access point does the same problem happen?

Being Meraki, you definately have not enabled SSID or user bandwidth shaping?

srpeters18 · ‎02-26-2016

Yes, the issues we are seeing are only related to the ports with access points plugged into them. Those ports have no QoS config applied.

We do have traffic shaping enabled per SSID so that multiple users don't overwhelm the 1Gbps access port, but we're seeing far less actual throughput than where the throttles are set. It's better when one or two people are associated with the AP, but still not great.

While I was away from the office yesterday, one of my network admins decided to test some configuration changes on our local switch. When he removed the QoS configuration from the uplink port on the IDF in my office, we saw immediate improvement in output drops. We reset the counters on several of our access point ports. After almost 20 hours, we still have zero drops on the uplink interface (so it hasn't seen a performance degredation). On the access point ports though, we've seen vast improvement. Most of them have zero drops over the last 20 hours, the worst one has just over 1200.

I'm going to monitor that for a few days and see what the effects on our local network are. To me, again having a very basic understanding of QoS, it seems that QoS is not set up properly.

We are also working with our VAR to get another engineer to take a look at our configuration to help resolve these issues.

Joseph W. Doherty · ‎02-26-2016

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages wha2tsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

I suspect there are several factors that contribute to your seeing drops, that you hadn't seen before.

First, moving to 10g, now helps aggregate traffic to overrun egress gig ports.

Second, 2960X edge switches might only provide 2 MB RAM buffers, for all the device ports, where your former 3Ks provide 2 MB RAM buffers for each bank of 24 copper ports and for the 2 or 4 uplink ports.

Third, 2960 and 3750 QoS, when enabled vs. disabled, is quite different in its buffer management. It's very easy to run out of buffers when QoS is enabled because buffers can be "reserved" to interfaces, even when unneeded. On 3Ks, I have much success, when setting custom buffer settings, is to allow the "common pool" have almost all the buffer RAM, and greatly increase the drop settings per interface. This allows individual ports to well handle transient bursts. (If would be problematic if you have sustained congestion or lots of concurrent port bursts.)