Re: Switch Port discards

Nehru Becirovic · ‎02-14-2018

On one of the ports on our recently upgraded switch we are getting a high number of discards. Switch we use is WS-C3650-24PD-E. We use Solar Winds for switch/port monitoring and these discards are happening at random intervals. Last few days there were none and then yesterday and today there are over million discards on the given port.

Port Configuration:

interface GigabitEthernet1/0/19
switchport access vlan 2
switchport mode access
speed 1000
duplex full
end

Show Interface:

GigabitEthernet1/0/19 is up, line protocol is up (connected)
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:00, output hang never
Last clearing of "show interface" counters 2w4d
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 2279088
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 700000 bits/sec, 286 packets/sec
5 minute output rate 3388000 bits/sec, 480 packets/sec
136094155 packets input, 51417445035 bytes, 0 no buffer
Received 190438 broadcasts (4216 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 4216 multicast, 0 pause input
0 input packets with dribble condition detected
218477298 packets output, 186197856830 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

As you can see total output drops is very high. Before I set up port monitor and WireShark, anyone has any idea what might be causing this?

Joseph W. Doherty · ‎02-14-2018

Can same device ingress ports overrun this gig port's egress? I.e. could multiple gig or any 10g ports send data to this port? If so, congestion is always a possibility. If there's enough port congestion, drops tend to occur.

chrihussey · ‎02-14-2018

To add to J Doherty's comment; particularly bursty data could cause this, especially in cases with higher rate incoming uplinks (10Gig or more). For the bursty data the Gig port becomes the bottleneck during the initial phases of the data transmission.

One thing to do is to change the load interval on the port to 30 seconds for a more accurate utilization reading and see if the drops increment slowly or come in bursts.

Joseph W. Doherty · ‎02-14-2018

Chris's suggestion to set load interval is 30 great, as it may indeed provide better insight into short term port usage, but keep in mind, even set to that, a lot can happen in 30 seconds, especially with gig. Microsbursts are in microseconds to millisecond. Also a high drop rate, can cause transfer rate to drop too. I.e. low utilization can sometime actually be indicative of (too) high utilization.

For your measured time period, again using the 30 seconds Chris suggests, try plotting both utilization and drops.

Nehru Becirovic · ‎02-15-2018

That's the thing. Highest burst of traffic I've seen was 28Mbps, it's usually well bellow that and port never comes close to being fully utilized. I've look at SolarWinds and there are no high bursts of traffic at all.

chrihussey · ‎02-15-2018

Understood, but to elaborate on J Doherty's earlier comment. These could be sub-second micro bursts. I doubt you would see them in Solarwinds nor with the 30 second load interval setting. For example, prior to a 28Mbps burst, there have been a much higher bursty rate that caused the drops.

The key is to identify if the drops are happening at a constant rate (which may be an issue) or if they happen occasionally that coincide with the increase in traffic.

One other thing, I noticed you have the port hard coded. With Gig you really shouldn't have to do that. Unless it is necessary, you may want to set it to auto (port will bounce when you do). I doubt it will have an effect on the drops, but it certainly couldn't hurt.

Regards

Rafael Carvallo · ‎02-16-2018

Hi,

This is likely a microburst thing (as others have mentioned), you say you don't see over 28 mbps but this measure is done over what timeframe? 60 seconds? Microbursts are short lived and quite difficult to detect unless you measure transference stats over small periods (ie. 1 second).

What shows if you issue this:

show interface IF_NAME counters errors

If you have a lot of xmit errors this could prove it's a microburst problem or the ports being overrun by speeds conversions

Cisco's documentation https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/12027-53.html

"""

errors. This is an indication that the internal send (Tx) buffer is full.

Common Causes: A common cause of Xmit-Err can be traffic from a high bandwidth link that is switched to a lower bandwidth link, or traffic from multiple inbound links that are switched to a single outbound link. For example, if a large amount of bursty traffic comes in on a gigabit interface and is switched out to a 100Mbps interface, this can cause Xmit-Err to increment on the 100Mbps interface. This is because the output buffer of the interface is overwhelmed by the excess traffic due to the speed mismatch between the inbound and outbound bandwidths.

"""

It's for the 6500 but I believe this is the same over all the catalyst line.

Also I'd start checking closely the switch for output drops by first reseting the counters:

clear counter IF_NAME

And then checking every second if this increases, if you see something like:

1.- No drops at all

2.- Sudden drops (a lot)

3.- No drops at all

4.- No drops at all

5.- Sudden drops (a lot)

Would also point to a microburst thing. If this is, you either increase the capacity of that link or increase the buffer size.

I had to troubleshoot an issue like this over a 3850 where the peak rate over a single 1GE interface never went over 300 Mbps, but a lot of drops were seen, turns out this interface was used as the uplink for 4 other devices all using 1 GE port upon further inspection we saw it was due to microbursting.

We ended using this command to increase the buffer allocation and the drops ended:

qos queue-softmax-multiplier 1200

here's the documentation for this: https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3850/software/release/3e/qos/command_reference/b_qos_3e_3850_cr/b_qos_3e_3850_cr_chapter_010.html#wp4043987550

Although I'm not sure whether this applies to the 3650 as well

Now if the issue isn't microburst related then this could probably be a hardware fault in which case you should open a TAC case.

HTH

Please remember to rate useful posts

Georg Pauwen · ‎02-15-2018

Hello,

on a side note, what is connected to this access port ?

Nehru Becirovic · ‎02-16-2018

Its our SolarWinds server.

Georg Pauwen · ‎02-16-2018

Running on what ? Windows Server (X) ?