Solved: Re: C3650 Output queue drop(OQD) problem..

inb · ‎03-17-2021

Hi.
I ask for help with the C3650 Output drops issue.
The G1/0/4 port of the switch is uplink and the bandwidth is set to 80Mbps.
If you check the actual equipment, speed/duplex is connected at 100Mbps / Full.
The output drops of the port rise very high in real time.

I set the output queue size to 100% through TAC support, but the symptoms are still there.
TAC says that the C3650's buffer is 6MB, so there is no further action that can be done.

The Tx packets of G1/0/4 were captured and analyzed with wireshark.
When looking at 100 Mbps at 1/1000 second, burst traffic was present.
Burst packets (over 100Mbps) were mostly identified as response packets from tcp 9988 (internal user service at the bottom of the switch).
I asked a customer to set up tcp 9988 QoS, but they said they couldn't.

Am I wrong how I define a burst packet?
speed: 100Mbps is standard
bandwidth: 80Mbps is standard
I set the definition of burst to be over 100Mbps, which is a speed.

I am looking for another workaround.
The C9200 switch has the same buffer as 6MB.
The C9300 switch has a buffer of 16MB, which is 2.5 times larger.
Could the problem be solved by replacing it with a C9300 unit?

I lack a lot of knowledge about QoS.
Please help.

WS-C3650-24TS
OS ver :16.6.5
WS-C3650-24TS-E buffer 6MB

// config

qos queue-softmax-multiplier 1200

policy-map QUEUE_POLICY
class class-default
bandwidth percent 100
queue-buffers ratio 100

interface GigabitEthernet1/0/4
description 80M
no switchport
bandwidth 80000
ip address 1.1.1.1 255.255.255.252
no ip proxy-arp
load-interval 30
service-policy output QUEUE_POLICY

switch# $show platform hardware fed switch 1 qos queue stats interface gi1/0/4

DATA Port:6 Enqueue Counters
-------------------------------
Queue Buffers Enqueue-TH0 Enqueue-TH1 Enqueue-TH2
----- ------- ----------- ----------- -----------
0 0 0 7081380812 13178495596
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0

DATA Port:6 Drop Counters
-------------------------------
Queue Drop-TH0 Drop-TH1 Drop-TH2 SBufDrop QebDrop
----- ----------- ----------- ----------- ----------- -----------
0 0 20838724278 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0

Note: Queuing stats are in bytes

switch# show platform hardware fed switch 1 qos queue config interfa interface gi1/0/4
DATA Port:6 GPN:4 AFD:Disabled QoSMap:3 HW Queues: 48 - 55
DrainFast:Disabled PortSoftStart:2 - 10000
----------------------------------------------------------
DTS Hardmax Softmax PortSMin GlblSMin PortStEnd
----- -------- -------- -------- -------- ---------
0 1 4 0 6 10000 7 800 3 300 3 10000
1 1 4 0 5 0 5 0 0 0 3 10000
2 1 4 0 5 0 5 0 0 0 3 10000
3 1 4 0 5 0 5 0 0 0 3 10000
4 1 4 0 5 0 5 0 0 0 3 10000
5 1 4 0 5 0 5 0 0 0 3 10000
6 1 4 0 5 0 5 0 0 0 3 10000
7 1 4 0 5 0 5 0 0 0 3 10000
Priority Shaped/shared weight shaping_step
-------- ------------- ------ ------------
0 7 Shared 50 0
1 0 Shared 10000 0
2 0 Shared 10000 0
3 0 Shared 10000 0
4 0 Shared 10000 0
5 0 Shared 10000 0
6 0 Shared 10000 0
7 0 Shared 10000 0

Weight0 Max_Th0 Min_Th0 Weigth1 Max_Th1 Min_Th1 Weight2 Max_Th2 Min_Th2
------- ------- ------- ------- ------- ------- ------- ------- ------
0 0 7968 0 0 8906 0 0 10000 0
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0

switch# show int gi1/0/4
GigabitEthernet1/0/4 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 00c8.8bda.4776 (bia 00c8.8bda.4776)
Description: 80M
Internet address is 1.1.1.1/30
MTU 1500 bytes, BW 80000 Kbit/sec, DLY 100 usec,
reliability 255/255, txload 18/255, rxload 27/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, media type is 10/100/1000BaseTX
input flow-control is on, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:01, output hang never
Last clearing of "show interface" counters 9w1d
Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 3118913078
Queueing strategy: Class-based queueing
Output queue: 0/40 (size/max)
30 second input rate 8524000 bits/sec, 2954 packets/sec
30 second output rate 5681000 bits/sec, 5325 packets/sec
5021160325 packets input, 2051802315243 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
0 input packets with dribble condition detected
14905240498 packets output, 3293735255327 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

Joseph W. Doherty · ‎03-19-2021

"Customer is rejecting QoS policy because the traffic (tcp 9988) is important."

Customer might well reject a particular QoS if they feel it's incorrect, but hopefully they don't reject QoS policies, in general. QoS often can improve and/or guarantee service level.

"Through TAC support, the software queue is defined as 100 percent of the output queue size of the G1/0/4 port, but the problem persists."

Yes, but your stats show drops at threshold 1, rather than threshold 2. Threshold 1, in your config, drops at the (default) 90% of your buffer resources. Increasing threshold 1 queue-limit, to 100%, will provide a bit more buffer space before packets are dropped.

"Are you saying your answer is long-term volume?"

I'm saying if the problem is long-term volume, even an "infinite" sized queue (buffers) would not cure the problem. With a larger enough buffer, you might avoid drops, but exchange them for queue latency.

For long term volume, either you need to: a) increase provided bandwidth to meet service needs or b) "better" manage network bandwidth so that traffic flows work within the limits of bandwidth that's available.

One method of traffic management, is better management of drops. Done, ideally, less drops can actually result in better "goodput". However, there are other traffic management techniques that may not require dropping traffic at all to "manage" a flow's bandwidth usage. For example, the later Linix TCP stacks regulate their transmission rate when they notice a jump in RTT. Some TCP stacks support ECN. Spoofing a TCP's RWIN and/or shaping TCP acks, can also regulate a TCP sender's bandwidth usage without dropping packets.

View solution in original post

Giuseppe Larosa · ‎03-18-2021

Hello @inb ,

a burst is defined as a sequence of consecutive packets at line speed

In your case:

>> speed/duplex is connected at 100Mbps / Full.

A sequence of packets at 100 Mpbs respecting the inter frame gap that is needed in Ethernet Layer 1 to distingush a frame from another.

>> The G1/0/4 port of the switch is uplink and the bandwidth is set to 80Mbps

On the wire there cannot be packets ar a speed over 100 Mbps

>> set the definition of burst to be over 100Mbps, which is a speed.

This is not possible as the line speed is 100 Mbps.

Burst traffic and micro burst traffic stress the port buffers.

So either you replace the current switch with C9300 or you pay for an upgrade of BW to full 100 Mbps so that the shaping effort is minimized

Hope to help

Giuseppe

inb · ‎03-18-2021

Thank you for answer.
If the bandwidth is 80Mbps and the actual physical connection is 100Mbps,
Should I see more than 80Mbps as a burst packet,
This is the content we asked if we should see 100Mbps or more as a burst packet.
I understand from your answer as below.
Packets exceeding 100Mbps are burst packets

Joseph W. Doherty · ‎03-18-2021

Disclaimer: I have no hands-on experience with 3650/3850, so I'm far, far from an expert on them.

BTW, do you believe your bandwidth statement of 80 Mbps is limiting egress traffic to that rate? If so, I don't believe that's true. If you need to limit bandwidth to 80 Mbps, you would need to activate a policer or shaper.

Yes, you might consider a burst as a sequence of packets, whose arrival, exceeds an egress's capacity. When such happens, those packets will need to be queued, otherwise they are dropped. When queue resources are exceeded, packets are dropped.

However, that (i.e. exceeding physical queue resources) might not actually be happening to you, as sometimes packets are dropped when a logical queue limit is exceeded. If the latter, extending the logical queue limit may mitigate drops.

In your stats I see all the drops are happening at threshold 1. This is a logical limit of about 90%. You might want to try extending TH 1 to match TH 2. (Doubtful it will cure your issue, but may mitigate it some, might also make it worse.)

When it comes to too much traffic, for an interface, we, ideally, want to identify whether it's short term bursts, or long term volume. The former can sometime be mitigated with additional queuing resources. The latter, though, either requires more bandwidth or "smarter" drop management. The latter attempts to signal congestion "earlier" to the senders in such a way they slow their transmission rate. Something like WRED is an example of this approach.

Another way to manage a sender's transmission rate (besides dropping packets) might be via something like ECN or "spoofing" TCP's RWIN or "shaping" traffic ACKs.

The goal of these traffic approaches, even dropping packets, is to get, somehow, senders to self throttle their transmission rates to avoid them from constantly "overflowing" available bandwidth.

"Could the problem be solved by replacing it with a C9300 unit?"

Insufficient information to say, as we don't know which "kind" of situation you have, i.e. short term bursts or long term volume. Additionally, unsure the 9300 has a huge increase in buffer resources. Even if it did, huge queues then create their own problems, i.e. additional queuing latency, possibly very variable.

An "ideal" solution would likely be better traffic rate (from senders) management.

inb · ‎03-18-2021

Thank you for answer.
Customer is rejecting QoS policy because the traffic (tcp 9988) is important.
That traffic exceeds 100Mbps at 1/1000th of a second in a 24-hour periodic pattern.
I guess this is increasing the output drops counter of the C3650 switch.
Through TAC support, the software queue is defined as 100 percent of the output queue size of the G1/0/4 port, but the problem persists.
Are you saying your answer is long-term volume?
When I simply look at the datasheet, the C3650 buffer is 6MB
The C9300 buffer is known as 16MB.

Real customers are not affected by the service even if the output drops increase significantly.
I also checked the OS bug, but there is no.
I have a lot of worries.

Joseph W. Doherty · ‎03-19-2021

"Customer is rejecting QoS policy because the traffic (tcp 9988) is important."

Customer might well reject a particular QoS if they feel it's incorrect, but hopefully they don't reject QoS policies, in general. QoS often can improve and/or guarantee service level.

"Through TAC support, the software queue is defined as 100 percent of the output queue size of the G1/0/4 port, but the problem persists."

Yes, but your stats show drops at threshold 1, rather than threshold 2. Threshold 1, in your config, drops at the (default) 90% of your buffer resources. Increasing threshold 1 queue-limit, to 100%, will provide a bit more buffer space before packets are dropped.

"Are you saying your answer is long-term volume?"

I'm saying if the problem is long-term volume, even an "infinite" sized queue (buffers) would not cure the problem. With a larger enough buffer, you might avoid drops, but exchange them for queue latency.

For long term volume, either you need to: a) increase provided bandwidth to meet service needs or b) "better" manage network bandwidth so that traffic flows work within the limits of bandwidth that's available.

One method of traffic management, is better management of drops. Done, ideally, less drops can actually result in better "goodput". However, there are other traffic management techniques that may not require dropping traffic at all to "manage" a flow's bandwidth usage. For example, the later Linix TCP stacks regulate their transmission rate when they notice a jump in RTT. Some TCP stacks support ECN. Spoofing a TCP's RWIN and/or shaping TCP acks, can also regulate a TCP sender's bandwidth usage without dropping packets.

inb · ‎03-19-2021

Thank you to everyone who answered.
It was very helpful.
I would like to adjust the QoS or queue size according to the situation.
Thank you very much.

Joseph W. Doherty · ‎03-20-2021

If your platform/IOS support the queue-limit command, that would allow you change your logical limits.

See: https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3650/software/release/3se/qos/configuration_guide/b_qos_3se_3650_cg/b_qos_3se_3650_cg_chapter_010.html#task_76777AC787C84236AD681BB964B6DBD7

If command is supported, also see if you can assign 400% to threshold 1.

inb · ‎03-20-2021

Thank you.
Let's test that part.
Have a nice weekend.

inb · ‎03-25-2021

Hi.
I added the following to the existing policy-map.

queue-limit cos 0 percent 100

The value of Max_Th1 increased from 8906 to 10000.
Is this what Th1 becomes 100%?

QoS is too difficult.

Joseph W. Doherty · ‎03-25-2021

"The value of Max_Th1 increased from 8906 to 10000.
Is this what Th1 becomes 100%?"

Believe so.

"QoS is too difficult."

Different platform QoS features, and syntax, especially on switches, helps make it "difficult".

inb · ‎03-25-2021

Thank you for answer.
I'm testing several things with the C3650, but I can't make Th1 at 400%.
I'll try more.
Thank you.