Solved: 3850 Output Queue Drops

Otaku78 · ‎02-06-2018

Hi all, I'm experiencing an issue with a WS-C3850-24S switch where a small percentage of packets are dropping from the output queues in about half of the trunks. I've done some research and found that it can be caused by an ingress interface that offers much more bandwidth than the egress interface. I'm not entirely sure this is what's causing it to be honest.

Ingress on this switch is a 2 x 10Gb EtherChannel trunk. Firmware version is 03.06.06E .

I've also tried to use some of the fixes described in these forums in regards to changing the queue parameters but have been unsuccessful in making those changes. I would really appreciate it if someone could help me out!

Trunk Interface stats:

reliability 255/255, txload 4/255, rxload 1/255

Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseBX-10U SFP
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:18, output never, output hang never
Last clearing of "show interface" counters 3w5d
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 959239
Queueing strategy: fifo
Output queue: 0/40 (size/max)
30 second input rate 6068000 bits/sec, 2450 packets/sec
30 second output rate 18567000 bits/sec, 1655 packets/sec
     147799652 packets input, 78574317037 bytes, 0 no buffer
     Received 405149 broadcasts (136421 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 136421 multicast, 0 pause input
     0 input packets with dribble condition detected

237563053 packets output, 252406070491 bytes, 0 underruns
959239 output errors, 0 collisions, 0 interface resets

QoS Queue stats:

-------------------------------
Queue Buffers Enqueue-TH0 Enqueue-TH1 Enqueue-TH2
----- ------- ----------- ----------- -----------
    0       0           0           0 1192754861
    1       0           0           0   664465155
    2       0           0           0           0
    3       0           0           0           0
    4       0           0           0           0
    5       0           0           0           0
    6       0           0           0           0
    7       0           0           0           0
DATA Port:21 Drop Counters
-------------------------------
Queue Drop-TH0    Drop-TH1    Drop-TH2    SBufDrop    QebDrop
----- ----------- ----------- ----------- ----------- -----------
    0           0           0           0           0           0
    1           0           0     3619155           0           0
    2           0           0           0           0           0
    3           0           0           0           0           0
    4           0           0           0           0           0
    5           0           0           0           0           0
    6           0           0           0           0           0
    7           0           0           0           0           0
AQM Broadcast Early WTD COUNTERS(In terms of Bytes)
--------------------------------------------------
PORT TYPE          ENQUEUE             DROP
--------------------------------------------------
UPLINK PORT-0        N/A               0
UPLINK PORT-1        N/A               0
UPLINK PORT-2        N/A               0
UPLINK PORT-3        N/A               0
NETWORK PORTS    21024980          140441674
RCP PORTS               0                  0
CPU PORT                0                  0
Note: Queuing stats are in bytes

burleyman · ‎02-15-2018

I have seen this at a couple of our customers after they upgrade to these new switches. If you do a wireshark capture if you see 2 or more ARP packets right in a row it could be this bug.

3850 duplicates pass-through ARP packets

Bug ID: CSCur30273

Description

Symptom:
When ARP request/Reply packets enter an access or trunk interface and its L2 switched to an another access or trunk interface in same VLAN the ARP request/Reply packets gets duplicated. This means that when one ARP packet enter the switch two identical ARP packets exit the switch. We have seen IP Packets getting duplicated. This only seems to affect ARP packets.

Conditions:
L2 switching with lanbase license

Workaround:
The issue is only observed with "lanbase" license. Issue is not seen with "ipbase" license.

Enable IPDT

3850-STACK#sh ip device tracking all

Global IP Device Tracking for clients = Disabled >>>>> IPDT is disabled by default on lanbase
-----------------------------------------------------------------------------------------------
IP Address MAC Address Vlan Interface Probe-Timeout State Source
-----------------------------------------------------------------------------------------------

3850-STACK(config)#int range gig1/0/19 , gig2/0/39

3850-STACK(config-if)#ip device tracking maximum ?
<0-65535> Maximum devices (0 means disabled)

3850-STACK(config-if-range)#ip device tracking maximum 20
3850-STACK(config-if-range)#end
3850-STACK#sh ip device tracking all
Global IP Device Tracking for clients = Enabled >>>>> Make sure IPDT is enabled
Global IP Device Tracking Probe Count = 3
Global IP Device Tracking Probe Interval = 30
Global IP Device Tracking Probe Delay Interval = 0
-----------------------------------------------------------------------------------------------
IP Address MAC Address Vlan Interface Probe-Timeout State Source
-----------------------------------------------------------------------------------------------

Total number interfaces enabled: 2
Enabled interfaces:
Gi1/0/19, Gi2/0/39
Further Problem Description:
none

Unicast ARP packets are duplicated

Bug ID: CSCuv78424

Description

Symptom:
3650/3850 duplicates unicast ARP request packets destined for its IP and sends back 2 replies for one ARP request packet sent by the host.

Conditions:
NA

Workaround:
NA

Further Problem Description:

Customer Visible

Was the description about this Bug Helpful?

(0)

Details

Last Modified: Jun 15,2016

Status: Fixed

Severity: 2 Severe

Product: Cisco Catalyst 3850 Series Switches

Cisco Catalyst 3850 Series Switches

Support Cases:

2

Known Affected Releases: 15.2(3)E
Known Fixed Releases:

15.2(2)E4

15.2(2)E5

15.2(3)E3

16.1(1.15)

16.1.2

16.2(0.151)

3.6(4)E

3.6(5)E

3.7(3)E

Denali-16.1.2

View solution in original post

Francesco Molino · ‎02-06-2018

Hi

Can you share your config and a quick sketch to see where this interface having drops is connected to?

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

Otaku78 · ‎02-06-2018

Hi Francesco hopefully this image can shed some light on the topology.

Config the Trunk interfaces linking the 3850 to the access closet switch are quite simple:

switchport trunk native vlan x
switchport trunk allowed vlan x,x,x,x
switchport mode trunk
switchport nonegotiate

Mark Malone · ‎02-07-2018

you could be overutilizing it , spiking the interface at times flooding the buffer causing drops to increment

theres a specific Cisco tshoot doc for output drops on 3850s it may help you identify and fix the issue , but you have a bit of a bottleneck there too coming down to a 1gb from 4x1gbs and 20gb link behind it

https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/200594-Catalyst-3850-Troubleshooting-Output-dr.html

Francesco Molino · ‎02-07-2018

I don't know if you applied any QoS, however based on your design, you should have bottleneck (agree with Mark).

what's the utilization of 20G links?

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

Otaku78 · ‎02-14-2018

Thanks Mark and Francesco that makes total sense I thought that was the case.

I haven't applied any QoS in my network yet, it's all best effort.

I'm not using VOIP or video conferencing so it's not absolutely critical at this stage but at some point but I do need to start classifying and marking the more important traffic between some servers and the client workstations.

Would it be essential that I start applying QoS on my traffic or simply change the queue buffers for best effort?

Sorry QoS is not my specialty so I apologise if I sound silly here.

Otaku78 · ‎02-14-2018

Here is the utilisation on the 2 x 10Gb EtherChannel links.

(sh controllers util)

Port Receive Utilization Transmit Utilization

Te1/1/3 0 0
Te1/1/4 0 0

It always looks like this when I'm viewing it so I'm assuming it's just the occasional bursty traffic that's causing the output drops on the egress port. I do have a 30 second load-interval set on the Portchannel interface that manages the links.

Tx and Rx Load always look very low.

TenGigabitEthernet1/1/3 is up, line protocol is up (connected)
Hardware is Ten Gigabit Ethernet, address is 547c.6966.ae9f (bia 547c.6966.ae9f)
MTU 1500 bytes, BW 10000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 10Gb/s, link type is auto, media type is SFP-10GBase-CX1
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:01, output never, output hang never
Last clearing of "show interface" counters 4w6d
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 2353000 bits/sec, 602 packets/sec
5 minute output rate 15850000 bits/sec, 1623 packets/sec
     3975312442 packets input, 5107283679684 bytes, 0 no buffer
     Received 34726967 broadcasts (27553546 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 27553546 multicast, 0 pause input
     0 input packets with dribble condition detected
     5482773919 packets output, 6821538322629 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 unknown protocol drops
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 pause output
     0 output buffer failures, 0 output buffers swapped out

Mark Malone · ‎02-15-2018

There s away to track burst traffic through wireshark to confirm if it is that flooding the buffer temporarily

https://www.cisco.com/c/en/us/support/docs/lan-switching/switched-port-analyzer-span/116260-technote-wireshark-00.html

Joseph W. Doherty · ‎02-15-2018

I'm unfamiliar with the 3650/3850 queuing architecture, but those stats do appear to show many drops for Q1 at threshold 1.

If this is caused by sustained congestion, often not much you can do on a switch to drop packets "smarter". If this is cause of transient congestion, often increasing queue sizes will decrease drop rate.

burleyman · ‎02-15-2018

I have seen this at a couple of our customers after they upgrade to these new switches. If you do a wireshark capture if you see 2 or more ARP packets right in a row it could be this bug.

3850 duplicates pass-through ARP packets

Bug ID: CSCur30273

Description

Symptom:
When ARP request/Reply packets enter an access or trunk interface and its L2 switched to an another access or trunk interface in same VLAN the ARP request/Reply packets gets duplicated. This means that when one ARP packet enter the switch two identical ARP packets exit the switch. We have seen IP Packets getting duplicated. This only seems to affect ARP packets.

Conditions:
L2 switching with lanbase license

Workaround:
The issue is only observed with "lanbase" license. Issue is not seen with "ipbase" license.

Enable IPDT

3850-STACK#sh ip device tracking all

Global IP Device Tracking for clients = Disabled >>>>> IPDT is disabled by default on lanbase
-----------------------------------------------------------------------------------------------
IP Address MAC Address Vlan Interface Probe-Timeout State Source
-----------------------------------------------------------------------------------------------

3850-STACK(config)#int range gig1/0/19 , gig2/0/39

3850-STACK(config-if)#ip device tracking maximum ?
<0-65535> Maximum devices (0 means disabled)

3850-STACK(config-if-range)#ip device tracking maximum 20
3850-STACK(config-if-range)#end
3850-STACK#sh ip device tracking all
Global IP Device Tracking for clients = Enabled >>>>> Make sure IPDT is enabled
Global IP Device Tracking Probe Count = 3
Global IP Device Tracking Probe Interval = 30
Global IP Device Tracking Probe Delay Interval = 0
-----------------------------------------------------------------------------------------------
IP Address MAC Address Vlan Interface Probe-Timeout State Source
-----------------------------------------------------------------------------------------------

Total number interfaces enabled: 2
Enabled interfaces:
Gi1/0/19, Gi2/0/39
Further Problem Description:
none

Unicast ARP packets are duplicated

Bug ID: CSCuv78424

Description

Symptom:
3650/3850 duplicates unicast ARP request packets destined for its IP and sends back 2 replies for one ARP request packet sent by the host.

Conditions:
NA

Workaround:
NA

Further Problem Description:

Customer Visible

Was the description about this Bug Helpful?

(0)

Details

Last Modified: Jun 15,2016

Status: Fixed

Severity: 2 Severe

Product: Cisco Catalyst 3850 Series Switches

Cisco Catalyst 3850 Series Switches

Support Cases:

2

Known Affected Releases: 15.2(3)E
Known Fixed Releases:

15.2(2)E4

15.2(2)E5

15.2(3)E3

16.1(1.15)

16.1.2

16.2(0.151)

3.6(4)E

3.6(5)E

3.7(3)E

Denali-16.1.2

Otaku78 · ‎06-06-2018

Apologies for the late reply but you were right burleyman it was a bug related issue. I upgraded to a new IOS release of 03.06.08.E.152-2.E8 and all of the output queues now report no packet drops after a 48 hour period. I'm confident that the issue has been resolved.

Thanks to all you awesome people for your help!

Jacqueline_2016 · ‎08-20-2019

I have this problem too.

Please, how did you fixed it?

Thank you

Jackie

Otaku78 · ‎08-20-2019

Hi Jackie This issue was related to the IOS version we were using and the error counters were not actually reflecting anything in reality, it was all bug related.

I simply upgraded to a later IOS version and the problem went away. Currently using image version 03.06.08E, an older image but very stable for us.

Jacqueline_2016 · ‎08-21-2019

Hi Okatu,

Thank you so much for your reply.

I am running 03.07.02E

And still getting the same error. I am worry because I have so many 3850 almost in the entire network.
Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.07.02E RELEASE SOFTWARE (fc1)

Jackie

Joseph W. Doherty · ‎08-21-2019

If it's not a bug, as noted in other posts, it might just be due to congestion on the port.