Re: 3650 Buffers / Output Drops

GRANT3779 · ‎10-05-2016

Has anyone experience of the above switches? I have multiple stacks (no more than 5 switches in a stack) and seeing very high number of Output Errors on a lot of ports. The total number of output drops exactly match the output errors, e.g

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 17890212

17890212 output errors, 0 collisions, 1 interface resets

Ports are running Full/1GB

Checking the CPU history also I see the switch has spiked to 100% on multiple occasions over 72 hr period.

I have read that this could be buffering issues. possibly due to bursty traffic and not much can actually be done.

Does this sound like potentially switch performance?

abpsoft · ‎10-06-2016

Hi,

same issue here. On every 3650 I have access to. Output errors count output discards (which seems wrong, never saw that before on Cisco gear, and I deployed plenty). Example from SNMP:

IF-MIB::ifOutDiscards.16 = Counter32: 652000252
IF-MIB::ifOutDiscards.17 = Counter32: 3619677524
IF-MIB::ifOutDiscards.18 = Counter32: 3116752094
IF-MIB::ifOutDiscards.19 = Counter32: 4169394348
IF-MIB::ifOutDiscards.20 = Counter32: 1617800266
IF-MIB::ifOutDiscards.21 = Counter32: 3103643133
IF-MIB::ifOutDiscards.22 = Counter32: 3295355258
IF-MIB::ifOutDiscards.23 = Counter32: 4029261027
[...]
IF-MIB::ifOutErrors.16 = Counter32: 652000252
IF-MIB::ifOutErrors.17 = Counter32: 3619677524
IF-MIB::ifOutErrors.18 = Counter32: 3116752094
IF-MIB::ifOutErrors.19 = Counter32: 4169394348
IF-MIB::ifOutErrors.20 = Counter32: 1617800266
IF-MIB::ifOutErrors.21 = Counter32: 3103643133
IF-MIB::ifOutErrors.22 = Counter32: 3295355258
IF-MIB::ifOutErrors.23 = Counter32: 4029261027

Please note how ifOutErrors follows exactly the ifOutDiscards counters on these ports. I consider this unnatural - discards are not errors, they've got their own counters for a reason. It's also unclear if the rate of discards (and resulting errors) is really correct. My monitoring sometimes states loaded interfaces would have an error rate of 50% or such, which would be entirely fatal - but at the same time, a ping going through that port is perfectly loss-free. I'm seeing this on standalone 3650s as well as stacked ones. I'm seeing it on port-channel members as well as individual switchports. I'm seeing it on ports manually set to a lower speed (e.g. 100Mbps) as well as on loaded 1G-ports. IOS-XE is 03.07.04.E.

According to show int Gi1/0/X controller the only error counter that goes upward here is Excess Defer frames:

[...]
            0 Late collision frames                 0 SymbolErr frames         
  14502702154 Excess Defer frames                   0 Collision fragments      
            0 Good (1 coll) frames                  0 ValidUnderSize frames    
            0 Good (>1 coll) frames                 0 InvalidOverSize frames   
            0 Deferred frames                       0 ValidOverSize frames     
            0 Gold frames dropped                   0 FcsErr frames
[...]

I can't find any documentation on what excess defer might be (at least in the context of non-CSMA/CD aka Full Duplex Ethernet). Is this counter abused to count queue tail drops? This drives us crazy due to the monitoring false positives (interface errors are something serious, output discards usually aren't).

What's wrong here? Just this years version of Cisco Counter Fun - or am I missing something? Replacing 3560s by 3650s got harder now we have new failure modes...

TIA & HTH,
Andre.

Michael Muenz · ‎12-17-2016

Hi,

I can confirm this issue. IMHO this happens cause we have massive traffic FROM 10G out to 1G.

Funny thing is, this happens only on a Stack with 3.7.3 but not on a Stack with 3.3.5?

If you have a test machine try to downgrade it.

There's also a known bug that output drops and output discards rise with same values, but only fixed in 16.5 or 3.6.5 interim.

Michael Please rate all helpful posts

cianpobrien · ‎08-09-2018

This could be Bug CSCvb31906. Definitely affects 03.06.05, but I'd ve very surprised if this doesn't happen with many versions. If you run #clear counters GigabitEthernet x/x/x followed by #clear platform qos queue stats interface GigabitEthernet x/x/x, the output drops figure for that interface goes through the roof. It's cosmetic, the drops aren't real, I've replicated this issue in production. The 3650 platform is incredibly buggy, I've seen other instances of interface output drops incrementing without these commands being run, even though in production no packet loss is being seen.