OutDiscards/Tail Drop on Nexus 93108TC-FX

j.a.m.e.s · ‎01-15-2020

Dear All,

Our Database folks are complaining about poor Oracle replication performance between two hosts. Both are connected at 1GB to a Nexus 9k. The interfaces are error free and we have no congestion problems in between the two hosts.

The only potential problem I can see is the OutDiscards on the server access ports (this is happening on most access ports):

switch# show interface counter errors

--------------------------------------------------------------------------------
Port          Align-Err    FCS-Err   Xmit-Err    Rcv-Err  UnderSize OutDiscards
--------------------------------------------------------------------------------
[...]
Eth1/16               0          0          0          0          0    11566590

These appear to be tail drops in queue 0:

chges-d-falsc-03# show queuing interface eth1/16

slot  1
=======


Egress Queuing for Ethernet1/16 [System]
------------------------------------------------------------------------------
QoS-Group# Bandwidth% PrioLevel                Shape                   QLimit
                                   Min          Max        Units
------------------------------------------------------------------------------
      3             -         1   100000000    100000000   bps            0(D)
      2             -         2   200000000    200000000   bps            0(D)
      1             1         -           -            -     -            9(D)
      0            99         -           -            -     -            9(D)
+-------------------------------------------------------------+
|                              QOS GROUP 0                    |
+-------------------------------------------------------------+
|                           |  Unicast       |Multicast       |
+-------------------------------------------------------------+
|                   Tx Pkts |    106082188409|         2820940|
|                   Tx Byts | 156539424371562|       265176863|
| WRED/AFD & Tail Drop Pkts |        11566590|               0|
| WRED/AFD & Tail Drop Byts |     17766072481|               0|
|              Q Depth Byts |               0|               0|
|       WD & Tail Drop Pkts |        11566590|               0|
+-------------------------------------------------------------+
|                              QOS GROUP 1                    |
+-------------------------------------------------------------+
|                           |  Unicast       |Multicast       |
+-------------------------------------------------------------+
|                   Tx Pkts |               0|               0|
|                   Tx Byts |               0|               0|
| WRED/AFD & Tail Drop Pkts |               0|               0|
| WRED/AFD & Tail Drop Byts |               0|               0|
|              Q Depth Byts |               0|               0|
|       WD & Tail Drop Pkts |               0|               0|
[ ... all QoS Groups up to group 7 have 0 drops ...]

I have an queuing policy applied at the system level:

  policy-map type queuing PM-OUT-QUEUE
    class type queuing c-out-q3
      priority level 1
      queue-limit dynamic 0
      shape min 100 mbps max 100 mbps
    class type queuing c-out-q2
      priority level 2
      queue-limit dynamic 0
      shape min 200 mbps max 200 mbps
    class type queuing c-out-q1
      bandwidth remaining percent 1
    class type queuing c-out-q-default <- I think this is QOS GROUP 0
      bandwidth remaining percent 99
      random-detect threshold burst-optimized
system qos
  service-policy type queuing output PM-OUT-QUEUE
  service-policy type network-qos PM-BOTH-JUMBO

Does anyone know what causes these output drops? I see it on most server interfaces, but not on the 40G uplinks. Could it cause problems with DB replication performance?

Thanks in advance for any assistance.

lucasfreitas83 · ‎01-16-2020

Hello j.a.m.e.s,

Output drops occurs for various reasons.

1) Consiste vlan in both sides.

2) When port send a pause frame (if the port receives more bandwidth it can process)

3) When your QOS policy hits the limit.

Please, see the link bellow.

https://www.cisco.com/c/en/us/support/docs/switches/nexus-3500-series-switches/118904-technote-nexus-00.html#anc3

Please rate and mark posts accordingly if you have found any of the information provided useful.
It will hopefully assist others with similar issues in the future.

Best regards,
Lucas Freitas

j.a.m.e.s · ‎01-25-2020

Thanks Lucas. The trouble with the link you mention is that it's specific to the Nexus 3500 so some of the commands didn't work on the N9k platform, which I guess has different buffering.

In the end I managed to show that the port was very close to the 1Gbit limit (due to Oracle Dataguard) traffic and so the tail drops are at least expected.

Joseph W. Doherty · ‎01-25-2020

"Does anyone know what causes these output drops?"

The most common cause is over subscription of port egress bandwidth, which may be sustained or transient.

"Could it cause problems with DB replication performance?"

It certainly may.

j.a.m.e.s · ‎01-26-2020

Is there any problem with using an N9k to host busy endpoints like this? Obviously the N9k is targetted at Datacentres, but say compared with the Catalyst, are the buffers sufficient for a busy host like a DB or NAS?

Joseph W. Doherty · ‎01-26-2020

In theory, the Nexus series should be more suitable for busy hosts, because as you note, they are the "datacenter" switches. Of course, perhaps like the Catalyst switch series, different models might vary in how well they support busy hosts and/or perhaps such issues might be mitigated with some "tuning". (Unfortunately, I don't have a lot of experience with Nexus switches, and not with your model series.)