Solved: Catalyst 3850 high Total output drops and output errors - Page 3

Antony Pasteris · ‎05-10-2016

We have put in service a catalyst 3850-12XS running ver 03.07.03E and we have noted that in certain ports, there is high output drops. Auto QOS is configured on this switch and we have tried removing the qos config on the ports having problem but it didn't change anything. From a performance point of view, at the moment the switch is running without problems... there is no network outage.

TenGigabitEthernet1/0/1 is up, line protocol is up (connected)
Hardware is Ten Gigabit Ethernet, address is 00cc.fc68.f681 (bia 00cc.fc68.f681)
Description: VLAN 599 XXXXXXXX
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 128/255, txload 16/255, rxload 10/255
Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 1000Mb/s, link type is auto, media type is 10/100/1000BaseTX SFP
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:19, output never, output hang never
Last clearing of "show interface" counters 10:12:51
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 249067352
Queueing strategy: Class-based queueing
Output queue: 0/40 (size/max)
5 minute input rate 41672000 bits/sec, 6894 packets/sec
5 minute output rate 65358000 bits/sec, 8267 packets/sec
69357766 packets input, 54278801831 bytes, 0 no buffer
Received 1362 broadcasts (1226 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 1226 multicast, 0 pause input
0 input packets with dribble condition detected
97533479 packets output, 108964531997 bytes, 0 underruns
249067352 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

As you can see from the ping below that's sending traffic through the link described above, there seem to be no connectivity issues apart from the fact that the output drops counter indicate something completely different.

XXXXXX#ping 4.2.2.1 repeat 1000 size 100
Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 4.2.2.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (1000/1000), round-trip min/avg/max = 10/14/30 ms

Could this be a bug ?

Nick Dee · ‎10-11-2016

Same symptoms. 100mb hard set link. 10mb upstream WAN pipe (bandwidth set to 10mb). Same type of output errors. Happened after moving from 3750X to 3850 with 3.06.04.E

mseanmiller · ‎10-12-2016

Inserting the fix/comment at the end of the thread.

Robert Hillcoat · ‎10-26-2016

Hi everyone, a document was released on the 30th of September 2016, with some in depth explanation on this issue.

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/200594-Catalyst-3850-Troubleshooting-Output-dr.html

It has resolved the issues i had on the 3850 switch.

John Vincent · ‎10-26-2016

Thanks Robert! This has been driving me crazy for some time now since we have moved to 3.6.4 and 3.6.5 across our 3850s. Would you mind sharing the configuration that you used on your side?

Ralph Laemmermeyer · ‎10-26-2016

Thanks for the help ... the issue is solved now ...

Regards,

Ralf

Michael Muenz · ‎12-04-2016

I have the same issue on a 3650 and inserted the service policy und global qos multiplier.

It worked fine for some ports, but there are still some ports reporting errors.

In the doc there's stated:

Note: In real life, this kind of scenario may not be possible as other interfaces may also use the buffer, but, this can definately help in reducing the packet drops to a certain level.

The maximum soft buffer available for an interface can be increased using this command however, you should also keep in mind that this is available only if no other interface is using these buffers.

Is there a good way between like change the values to 600 or so?

Michael Please rate all helpful posts

sajidabbas · ‎10-10-2017

Hi,

So pretty much experiencing the same issue as in this thread. We have 3850 on 3.6.6

Scenario: Observing output errors on Gig port-channel interfaces so cant play with speed.

Create class maps and applied service policy without any classification or action. Also inserted the global qos multiplier command. Looked to be working at first but started seeing output drops again.

Is configuring bandwidth or priority command important or can we just create class and policy maps along with multiplier.

Thanks

Sajid

sajidabbas · ‎10-10-2017

One other thing. Do we apply the service-policy on individual interface or port-channel interface.

Thanks
Sajid

suelange · ‎01-11-2018

Not to dredge up an old topic but I see replies as recently as 10/10/2017 so hopefully it's okay to continue on this thread. We see the high output drops/errors. They always match by the way...the errors and the drops. The relay is 255/255, and I can see that the speed/duplex match the attached device. When I talk to the owner of the servers or devices where these errors appear to be happening, they indicate they are not having any problems. No performance issues...nothing. Now, if we are seeing over 3 million drops in an hour, and no one's complaining, in my book that has to be a false positive. Thus, it doesn't make sense to me to start dinking with the configs, putting on all kinds of policies and what not...

We are at version 3.06.05E. I'm seeing lots of folks on this forum stating the issue exists with even higher versions.

First, is this a cosmetic bug, at least in some instances? Second, in what version of IOS can I expect to get past this problem? We use these 3850's all over our data center, all over our branch offices...it is not a trivial task to get the down time required to upgrade the IOS. If I go that route, am I going to get past this problem? Otherwise I don't want to bother.

The only real problem I have right now is that the high error count is messing up daily reports from Solarwinds. Higher ups are looking at these massive number of errors and demanding I "fix" something. But when I talk to the administrators of the attached devices they are saying...We don't have a problem...

Is the IOS reporting these errors incorrectly by any chance?

Thoughts?

ti55ot · ‎01-25-2018

@suelange wrote:

Not to dredge up an old topic but I see replies as recently as 10/10/2017 so hopefully it's okay to continue on this thread. We see the high output drops/errors. They always match by the way...the errors and the drops. The relay is 255/255, and I can see that the speed/duplex match the attached device. When I talk to the owner of the servers or devices where these errors appear to be happening, they indicate they are not having any problems. No performance issues...nothing. Now, if we are seeing over 3 million drops in an hour, and no one's complaining, in my book that has to be a false positive. Thus, it doesn't make sense to me to start dinking with the configs, putting on all kinds of policies and what not...

We are at version 3.06.05E. I'm seeing lots of folks on this forum stating the issue exists with even higher versions.

First, is this a cosmetic bug, at least in some instances? Second, in what version of IOS can I expect to get past this problem? We use these 3850's all over our data center, all over our branch offices...it is not a trivial task to get the down time required to upgrade the IOS. If I go that route, am I going to get past this problem? Otherwise I don't want to bother.

The only real problem I have right now is that the high error count is messing up daily reports from Solarwinds. Higher ups are looking at these massive number of errors and demanding I "fix" something. But when I talk to the administrators of the attached devices they are saying...We don't have a problem...

Is the IOS reporting these errors incorrectly by any chance?

Thoughts?

It is probably linked to the known bug CSCvb65304. Explained here:-
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb65304/?referring_site=bugquickviewclick

kaysammy · ‎01-26-2018

I have the same issue on the 10Gig interfaces. Also, there isn't is a massive amounts of data being pushed. Both ends are conencted at 10Gb and I see most of the transfer speed at 150Mb/s tops. The bug article doesn't seem to apply as I don't see Drop-TH counter increasing with the output drops.

TenGigabitEthernet2/1/4 is up, line protocol is up (connected)
Hardware is Ten Gigabit Ethernet, address is b4e9.b041.f9a0 (bia b4e9.b041.f9a0)
MTU 9014 bytes, BW 10000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 5/255
Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 10Gb/s, link type is auto, media type is SFP-10GBase-SR
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:02, output never, output hang never
Last clearing of "show interface" counters 18:52:24
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 128358301
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 211123000 bits/sec, 3614 packets/sec
5 minute output rate 43939000 bits/sec, 3646 packets/sec
179315605 packets input, 1156571093772 bytes, 0 no buffer
Received 2747 broadcasts (2424 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 2424 multicast, 2624 pause input
0 input packets with dribble condition detected
214740048 packets output, 877585688746 bytes, 0 underruns
128358301 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

TenGigabitEthernet1/1/4 is up, line protocol is up (connected)
Hardware is Ten Gigabit Ethernet, address is b4e9.b041.fa20 (bia b4e9.b041.fa20)
MTU 9014 bytes, BW 10000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 5/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 10Gb/s, link type is auto, media type is SFP-10GBase-SR
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:03, output never, output hang never
Last clearing of "show interface" counters 18:53:31
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 4957190
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 39383000 bits/sec, 3449 packets/sec
5 minute output rate 208199000 bits/sec, 3598 packets/sec
208654700 packets input, 876154464163 bytes, 0 no buffer
Received 2749 broadcasts (2427 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 2427 multicast, 2234 pause input
0 input packets with dribble condition detected
181792082 packets output, 1157854911924 bytes, 0 underruns
4957190 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

sho platform qos queue stats tenGigabitEthernet 1/1/4
DATA Port:3 Enqueue Counters
-------------------------------
Queue Buffers Enqueue-TH0 Enqueue-TH1 Enqueue-TH2
----- ------- ----------- ----------- -----------
0 0 0 0 15703949049
1 0 0 0 7377808052
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
DATA Port:3 Drop Counters
-------------------------------
Queue Drop-TH0 Drop-TH1 Drop-TH2 SBufDrop QebDrop
----- ----------- ----------- ----------- ----------- -----------
0 0 0 279656 0 0
1 0 0 5698055516 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
AQM Broadcast Early WTD COUNTERS(In terms of Bytes)
--------------------------------------------------
PORT TYPE ENQUEUE DROP
--------------------------------------------------
UPLINK PORT-0 N/A 0
UPLINK PORT-1 N/A 0
UPLINK PORT-2 N/A 0
UPLINK PORT-3 N/A 0
NETWORK PORTS 50896 50896
RCP PORTS 0 0
CPU PORT 88711794 177389774
Note: Queuing stats are in bytes

suelange · ‎01-29-2018

hmmm...well first of all, which drop TH counter are we talking about? Mine are all clean except TH2, and it appears to increase as the output drops also increase (sorry for the format, it didn't cut and paste with the tabs as expected...

RACK-A1#show int gi 1/0/40
GigabitEthernet1/0/40 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 706b.b9f8.86a8 (bia 706b.b9f8.86a8)
Description: ARTEMIS SHB1
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 216/255, txload 57/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output never, output hang never
Last clearing of "show interface" counters 10w0d
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 3522625478
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 5335000 bits/sec, 9680 packets/sec
5 minute output rate 225296000 bits/sec, 19022 packets/sec
4630702390 packets input, 619130054205 bytes, 0 no buffer
Received 592 broadcasts (0 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
0 input packets with dribble condition detected
7868946190 packets output, 9608893151593 bytes, 0 underruns
3522625478 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out
RACK-A1#sh pl qos queue stats gigabitEthernet 1/0/40
DATA Port:18 Enqueue Counters
-------------------------------
Queue Buffers Enqueue-TH0 Enqueue-TH1 Enqueue-TH2
----- ------- ----------- ----------- -----------
0 0 0 1423742739 1523581295
1 0 0 0 33976055625
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
DATA Port:18 Drop Counters
-------------------------------
Queue Drop-TH0 Drop-TH1 Drop-TH2 SBufDrop QebDrop
----- ----------- ----------- ----------- ----------- -----------
0 0 0 0 0 0
1 0 0 6531314493 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
AQM Broadcast Early WTD COUNTERS(In terms of Bytes)
--------------------------------------------------
PORT TYPE ENQUEUE DROP
--------------------------------------------------
UPLINK PORT-0 N/A 0
UPLINK PORT-1 N/A 0
UPLINK PORT-2 N/A 0
UPLINK PORT-3 N/A 0
NETWORK PORTS 9289776564 40111214359
RCP PORTS 0 0
CPU PORT 0 0
Note: Queuing stats are in bytes

But at the end of the day, I find the text of the bug report hard to understand.

"Symptom:
Output drops and Output errors increment simultaneously in show interfaces when only output drops are expected.

Conditions:
To confirm the output drops are because of egress buffer drops use "sh pl qos queue stats gigabitEthernet x/y/z" and look for "Drop-TH" counters. This counter should increment the same amount as the output drops counter in show interface."

"when only output drops are expected"...I don't really expect output drops unless I've done something like misconfigure the link or if I know that the traffic is such that it will overrun the link capacity. So I'm not sure how this applies to my situation.

Further it says "his counter should increment the same amount as the output drops counter in show interface." By "should", are they saying that having the same counter incrementation is normal and I don't have the problem or, are they saying if I have the bug, I should see the counters increment at the same time becuase that is the indication of the bug?

satish.txt1 · ‎02-10-2018

I am having same issue, we have 8x10G module on this switch and on one of trunk switch i am getting output error when traffic hit certain number like 6Gbps on that interface, we don't have any QoS etc.. we just installed switch as it is, without touching any QoS

This is the result of my interface qos. what should i do to fix it and does changing QoS on production will cause any issue?

#show platform qos queue config TenGigabitEthernet 1/1/4
DATA Port:1 GPN:56 AFD:Disabled QoSMap:0 HW Queues: 8 - 15
  DrainFast:Disabled PortSoftStart:1 - 1080
----------------------------------------------------------
  DTS Hardmax   Softmax  PortSMin GlblSMin  PortStEnd
  --- --------  -------- -------- --------- ---------
 0   1  5   120  6   480  6   320   0     0   3  1440
 1   1  4     0  7   720  3   480   2   180   3  1440
 2   1  4     0  5     0  5     0   0     0   3  1440
 3   1  4     0  5     0  5     0   0     0   3  1440
 4   1  4     0  5     0  5     0   0     0   3  1440
 5   1  4     0  5     0  5     0   0     0   3  1440
 6   1  4     0  5     0  5     0   0     0   3  1440
 7   1  4     0  5     0  5     0   0     0   3  1440

scwillis4655 · ‎05-10-2018

We are having the same issue. 3.4Billion drops on an interface that rarely gets above 500pps. This was all in about 2 minutes. That's impossible.

vnealy1762 · ‎05-29-2018

I am having the same issue.

We are running 03.07.05E on the C3850-12X48U model.

I am not seeing interface errors with my issue, just the very high output q-drop numbers over a short period of time and with interfaces that get very little traffic.