Solved: Re: Unexplained Output Drops

simoesmarco8626982 · ‎10-13-2020

Hi all,

Hope you can help me

I have two interfaces having output drops and I really can't figure out why. This are all gigabit interfaces, and no matter what I do those two interfaces that go to two access switches keep having output drops.

Please see image:

https://ibb.co/0cRg3gY

Any idea on what this can be please?

Joseph W. Doherty · ‎10-13-2020

"For what I understand the qos is not dropping anything, but still the interface is having output drops..."

No, it appears your implicit class-default is where all the drops are happening.

Two things you might try.

You might try expanding the queue depth for class-default (not all CBWFQ versions allow this). And/or enable FQ in class-default. (Some CBWFQ versions also allow defining FQ flow queue depths.)

Of the two, I would suggest first trying FQ. This may, or may not, reduce your overall drops in this class, but more likely drops will more against heavy bandwidth flows. It should also better share bandwidth across flows in that class. (I.e. light bandwidth usage flows, "performance" will generally improve. However, the converse, heavy bandwidth usage flows performance will degrade, isn't always true.)

View solution in original post

Joseph W. Doherty · ‎10-14-2020

Don't be so fast to ignore what QoS can do, and it doesn't need to be overly complex to often greatly improve how traffic is serviced.

For video, it's often very bursty. For buffered video (like streaming video), increasing queue depth can often insure such video works fine. What you need to insure is the video's average bandwidth requirements, and a little more, are met.

Real-time video, like video conferencing, is another story. For that, with QoS, you need to insure it gets the bandwidth and low latency it needs. Its treatment needs to be much like VoIP's, but it's generally both much more bandwidth demanding, and often highly variable in its bandwidth demands at any point in time (somewhat like VoIP using a compressed codec). Such traffic requires sufficient bandwidth, it's seldom queued.

If you have both video kinds, ideally, they should be mapped into different service classes, because their service needs are different.

Many kinds of TCP traffic can be very elastic in its bandwidth usage. Bulk TCP data flows, by design, will try to obtain all available bandwidth they can. This can be detrimental to other data flows, and even to the flow itself, as it sometimes much slows when it slams into bandwidth limits.

FQ in QoS, alone, often handles many kinds of traffic very nicely.

Again, try:

policy-map VIDEO_OUT_HBAC_OUT
class HBAC_OUT
priority percent 5 !BTW, generally EF is given LLQ treatment
class VIDEO_OUT
bandwidth percent 90
class class-default
bandwidth percent 5
fair-queue

Yes, a port-channel may help, especially long term if you add cameras, but Etherchannel has its own foibles.

View solution in original post

Joseph W. Doherty · ‎10-14-2020

Ah, a 3650. Yes, it won't support FQ.

As to changing HBAC_OUT class, delete the class (in the policy) and re-add it (with priority), that may allow the change. (Bandwidth classes are sequence dependent, but PQ should, I believe, come first regardless where you add it to the policy.)

For later Catalyst 3Ks, a standing recommendation is add the command "qos queue-softmax-multiplier 1200", which you may have already done when you tuned buffers.

If you haven't already seen Catalyst 3850: Troubleshooting Output drops it should also would apply to a 3650.

View solution in original post

Georg Pauwen · ‎10-13-2020

Hello,

the interfaces are configured with class based queuing. Post the running configuration of your device.

Joseph W. Doherty · ‎10-13-2020

It appears (?) you're using some form of CBWFQ, so with that, and depending on platform, there are lots of possible reasons why you have so many drops. You've not provided enough information, beyond saying, that often, when "offered" data cannot be transmitted fast enough, buffering/queuing resources are exceeded and drops result.

I do see that the one interface has about a 50% transmission load average, over 5 minutes. If the average load is that high, it's not uncommon for short during bursts to cause drops.

simoesmarco8626982 · ‎10-13-2020

Exactly, basically I have the following on the interfaces and funny enough I was going to open a post to ask if what I did was correct.

I have the following configuration created:

ip access-list extended 100
10 permit udp any any
!
class-map match-any HBAC_OUT
match dscp ef
!
class-map match-any VIDEO_IN
match access-group 100
class-map match-any VIDEO_OUT
match dscp cs4
!
policy-map VIDEO_IN
class VIDEO_IN
set dscp cs4
!
policy-map VIDEO_OUT_HBAC_OUT
class HBAC_OUT
bandwidth percent 5
class VIDEO_OUT
bandwidth percent 90

Basically what I want to achieve is, for the TCP traffic to not go on top of the bandwidth available for the UDP. Hence the 90% bandwidth and with a DSCP CS4, then at the same time I have a radio on the system that has priority over anything else and that I'm reserving 5% of bandwidth for it.

Then I applied the service policies below, on the interfaces:

service-policy output VIDEO_OUT_HBAC_OUT
service-policy input VIDEO_IN

Specifically on those two ports I've applied the service policy OUTPUT, since my workstations are on those two access switches.

Am I correct in doing this?

Just to clarify, all my traffic is video traffic being transmitted in UDP unicast, only in some special occasions the workstations will request TCP traffic and this TCP can't steal bandwidth from the video being transmitted if not the pictures start breaking up

Could this CBWFQ be the cause of the issue? I tried disabling it from this two ports and I was still getting drops

Joseph W. Doherty · ‎10-13-2020

Please post the two interfaces' service-policy stats.

Is your video, video streaming, real-time or both?

We'll know more when we see your stats, but if we're lucky, the solution might be as simple as increasing a class's queue depth. (Which might have still been the problem, w/o CBWFQ, as the default egress queue is only 40.)

simoesmarco8626982 · ‎10-13-2020

Basically is both, I receive live streaming from the cameras and recorded from the recorders.

I just did a check from the stats, and I get this on this two ports:

TenGigabitEthernet1/0/23

Service-policy output: VIDEO_OUT_HBAC_OUT

Class-map: HBAC_OUT (match-any)
0 packets
Match: dscp ef (46)
Queueing

(total drops) 0
(bytes output) 8125159
bandwidth 5% (50000 kbps)

Class-map: VIDEO_OUT (match-any)
0 packets
Match: dscp cs4 (32)
Queueing

(total drops) 0
(bytes output) 40961291715
bandwidth 90% (900000 kbps)

Class-map: class-default (match-any)
0 packets
Match: any

(total drops) 695119468
(bytes output) 696530583836

Service-policy output: VIDEO_OUT_HBAC_OUT

Class-map: HBAC_OUT (match-any)
0 packets
Match: dscp ef (46)
Queueing

(total drops) 0
(bytes output) 1900925225
bandwidth 5% (50000 kbps)

Class-map: VIDEO_OUT (match-any)
0 packets
Match: dscp cs4 (32)
Queueing

(total drops) 0
(bytes output) 31202406288
bandwidth 90% (900000 kbps)

Class-map: class-default (match-any)
0 packets
Match: any

(total drops) 8450184
(bytes output) 493730198005

For what I understand the qos is not dropping anything, but still the interface is having output drops...

Joseph W. Doherty · ‎10-13-2020

"For what I understand the qos is not dropping anything, but still the interface is having output drops..."

No, it appears your implicit class-default is where all the drops are happening.

Two things you might try.

You might try expanding the queue depth for class-default (not all CBWFQ versions allow this). And/or enable FQ in class-default. (Some CBWFQ versions also allow defining FQ flow queue depths.)

Of the two, I would suggest first trying FQ. This may, or may not, reduce your overall drops in this class, but more likely drops will more against heavy bandwidth flows. It should also better share bandwidth across flows in that class. (I.e. light bandwidth usage flows, "performance" will generally improve. However, the converse, heavy bandwidth usage flows performance will degrade, isn't always true.)

simoesmarco8626982 · ‎10-13-2020

I think the only option will pass by creating a port channel. The load on the link will only increase in the foreseeable future and playing with the QoS will only create a temporary fix. When they start putting more heavy bandwidth cameras on the system (HD) those two links are going to start "crying".

Joseph W. Doherty · ‎10-14-2020

Don't be so fast to ignore what QoS can do, and it doesn't need to be overly complex to often greatly improve how traffic is serviced.

For video, it's often very bursty. For buffered video (like streaming video), increasing queue depth can often insure such video works fine. What you need to insure is the video's average bandwidth requirements, and a little more, are met.

Real-time video, like video conferencing, is another story. For that, with QoS, you need to insure it gets the bandwidth and low latency it needs. Its treatment needs to be much like VoIP's, but it's generally both much more bandwidth demanding, and often highly variable in its bandwidth demands at any point in time (somewhat like VoIP using a compressed codec). Such traffic requires sufficient bandwidth, it's seldom queued.

If you have both video kinds, ideally, they should be mapped into different service classes, because their service needs are different.

Many kinds of TCP traffic can be very elastic in its bandwidth usage. Bulk TCP data flows, by design, will try to obtain all available bandwidth they can. This can be detrimental to other data flows, and even to the flow itself, as it sometimes much slows when it slams into bandwidth limits.

FQ in QoS, alone, often handles many kinds of traffic very nicely.

Again, try:

policy-map VIDEO_OUT_HBAC_OUT
class HBAC_OUT
priority percent 5 !BTW, generally EF is given LLQ treatment
class VIDEO_OUT
bandwidth percent 90
class class-default
bandwidth percent 5
fair-queue

Yes, a port-channel may help, especially long term if you add cameras, but Etherchannel has its own foibles.

simoesmarco8626982 · ‎10-14-2020

Thank you Joseph!

Was able to apply that configuration, just the "priority percent 5", I couldn't because the switch doesn't allow, and states if it's using bandwidth on the policy already, priority can't be used.

Regarding the class-default, this 3650 doesn't have the option "Fair-Queue", still I tuned up the "queue-buffers" and implemented a ratio of 30 and I think the output drops stopped... will need to check a little bit later

Joseph W. Doherty · ‎10-14-2020

Ah, a 3650. Yes, it won't support FQ.

As to changing HBAC_OUT class, delete the class (in the policy) and re-add it (with priority), that may allow the change. (Bandwidth classes are sequence dependent, but PQ should, I believe, come first regardless where you add it to the policy.)

For later Catalyst 3Ks, a standing recommendation is add the command "qos queue-softmax-multiplier 1200", which you may have already done when you tuned buffers.

If you haven't already seen Catalyst 3850: Troubleshooting Output drops it should also would apply to a 3650.

simoesmarco8626982 · ‎10-14-2020

Well, I tried the approach of increasing the "pipe" and created two port-channels (one per each access switch). So now I have a LACP with 2Gbps per Port-Channel, but I'm still seeing Out-Discards as seen below (the Port Channel 1 is an aggregation of the Port 1/0/23 with 1/1/3 and the port Channel 2 is and aggregation of port 1/0/24 and 1/1/4

Port-channel1 is up, line protocol is up (connected)
Hardware is EtherChannel, address is d4c9.3c5b.2398 (bia d4c9.3c5b.2398)
MTU 1500 bytes, BW 2000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 36/255, rxload 8/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is N/A
input flow-control is on, output flow-control is unsupported
Members in this channel: Te1/0/24 Te1/1/3
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:01, output 00:00:00, output hang never
Last clearing of "show interface" counters 01:59:28
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 2126488
Queueing strategy: fifo
Output queue: 0/80 (size/max)
5 minute input rate 65624000 bits/sec, 7177 packets/sec
5 minute output rate 286678000 bits/sec, 28180 packets/sec
58864265 packets input, 69809220246 bytes, 0 no buffer
Received 96654 broadcasts (16480 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 16480 multicast, 0 pause input
0 input packets with dribble condition detected
185764338 packets output, 235630412889 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

Port-channel2 is up, line protocol is up (connected)
Hardware is EtherChannel, address is d4c9.3c5b.2397 (bia d4c9.3c5b.2397)
MTU 1500 bytes, BW 2000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 31/255, rxload 11/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is N/A
input flow-control is on, output flow-control is unsupported
Members in this channel: Te1/0/23 Te1/1/4
ARP type: ARPA, ARP Timeout 04:00:00
Last input 02:42:37, output 00:00:00, output hang never
Last clearing of "show interface" counters 02:03:00
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 3757391
Queueing strategy: fifo
Output queue: 0/80 (size/max)
5 minute input rate 86421000 bits/sec, 8741 packets/sec
5 minute output rate 247078000 bits/sec, 24903 packets/sec
66832118 packets input, 82572982688 bytes, 0 no buffer
Received 84084 broadcasts (2296 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 2296 multicast, 0 pause input
0 input packets with dribble condition detected
192117801 packets output, 237804894332 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
Gi1/0/1 0 0 0 0 0 0
Gi1/0/2 0 0 0 0 0 0
Gi1/0/3 0 0 0 0 0 0
Gi1/0/4 0 0 0 0 0 0
Gi1/0/5 0 0 0 0 0 0
Gi1/0/6 0 0 0 0 0 0
Gi1/0/7 0 0 0 0 0 0
Gi1/0/8 0 0 0 0 0 0
Gi1/0/9 0 0 0 0 0 0
Gi1/0/10 0 0 0 0 0 0

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
Gi1/0/11 0 0 0 0 0 0
Gi1/0/12 0 0 0 0 0 0
Gi1/0/13 0 0 0 0 0 0
Gi1/0/14 0 0 0 0 0 0
Gi1/0/15 0 0 0 0 0 0
Gi1/0/16 0 0 0 0 0 0
Te1/0/17 0 0 0 0 0 0
Te1/0/18 0 0 0 0 0 0
Te1/0/19 0 0 0 0 0 0
Te1/0/20 0 0 0 0 0 14500
Te1/0/21 0 0 0 0 0 0

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
Te1/0/22 0 0 0 0 0 0
Te1/0/23 0 0 0 0 0 3101662
Te1/0/24 0 0 0 0 0 2037347
Te1/1/1 0 0 0 0 0 0
Te1/1/2 0 0 0 0 0 0
Te1/1/3 0 0 0 0 0 282694
Te1/1/4 0 0 0 0 0 655729
Po1 0 0 0 0 0 2320041
Po2 0 0 0 0 0 3757391

Any ideas please?

Joseph W. Doherty · ‎10-14-2020

Port Channel 1 is an aggregation of the Port 1/0/23 with 1/1/3

Port-channel1 is up, line protocol is up (connected)
port Channel 2 is and aggregation of port 1/0/24 and 1/1/4

?

port Channel 2 is and aggregation of port 1/0/24 and 1/1/4

Port-channel2 is up, line protocol is up (connected)
Members in this channel: Te1/0/23 Te1/1/4

?

Ah, one of the foibles of Etherchannel is how well member links share the load. From the member link drops, I suspect one of the pair is getting much more traffic than the other. What's the hashing algorithm being used (and platform options)?

BTW, if you do try what I suggest in my prior post, what's the policy-map's class command options (for class-default)?

simoesmarco8626982 · ‎10-14-2020

Continuing what I post above, this is the options I receive for the class-default:

Bandwidth - Bandwidth
drop - Drop all packets
encap-sequence - MCMLP encapsulate sequence
exit - Exit from class action configuration mode
netflow-sampler - NetFlow action
no - Negate or set default values of a command
police - Police
priority - Strict Scheduling Priority for this Class
queue-buffers - queue buffer
queue-limit - Queue Max Threshold for Tail Drop
service-policy - Configure QoS Service Policy
set - Set QoS values
shape - Traffic Shaping

Besides increasing the buffer ratio to 30, I increase the queue-limit to 128 packets as well

Until the moment it seems stable...

Joseph W. Doherty · ‎10-14-2020

Are you aware of BDP (bandwidth delay product)? If so, egress queue, for TCP traffic, on routers, optimally should be about half that. Possibly this value is different from your 128 packets. Calculate it and see what value you get.