Solved: Nexus 3064 CPU spike and packetloss

satish.txt1 · ‎05-23-2018

I have L3 Cisco 3064PQ switch which is running latest "nxos.7.0.3.I4.7.bin" basically I am using for routing so my ISP 10Gbps link terminated on it and currently live traffic is 10Gbps on switch but i have notice periodic packetloss and not sure where its coming from, after digging i found when my CPU spike up to 70% i have seeing packet loss.

If you notice in below picture that 60% spike and at same time i have noticed 1 ping packet loss. How do i debug this issue and find out what is that pike for and why its happening periodically, I am noticed every 1 minute and 30second this spike is coming.. this is constantly happening..

    111113641111 11 111 11 11 1  11   1   1 11  1 11     11 12
    011245362136900971084282051974067736846931972921847680082588
100
 90
 80
 70
 60       #
 50       ##
 40      ###
 30      ###                                                 #
 20      ###   #    #                     #                  #
 10 ##################################### ########### ##########
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5

               CPU% per second (last 60 seconds)
                      # = average CPU%



    655647776676716562796267574676486635666266736656375765565666
    389694233637271934049753800212280577262865100886229818827948
100
 90                    *           *
 80                    *           *                   *
 70    * *** ****     *** ** *  *  * *   *  ***  * * * *     * *
 60 **** ******** *** *** **** *** *** **** *** **** ***********
 50 ************* *** *** **** *** *** **** *** **** ***********
 40 ************* *** *** ***************** *** **** ***********
 30 ************* *** ******************************************
 20 ***#**###**##******#***#*#*#***#******#**#**#****#**********
 10 ############################################################
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5

               CPU% per minute (last 60 minutes)
              * = maximum CPU%   # = average CPU%

CPU proc table output during spike 50%

# show processes cpu sort | ex 0.00


PID    Runtime(ms)  Invoked   uSecs  1Sec    Process
-----  -----------  --------  -----  ------  -----------
12624    641746706  456010193   1407   7.00%  t2usd
   27   1262455737  1006811706   1253   4.00%  ksmd
11145    288596961  111352447   2591   2.00%  pfmclnt
11367          113       253    448   1.00%  arp
11402          200       349    575   1.00%  netstack
CPU util  :   51.33% user,    9.62% kernel,   39.03% idle
Please note that only processes from the requested vdc are shown above

PID    Runtime(ms)  Invoked   uSecs  1Sec    Process
-----  -----------  --------  -----  ------  -----------
12624    641774351  456031502   1407  26.50%  t2usd
   27   1262516503  1006859899   1253   8.00%  ksmd
11367          113       253    448   2.00%  arp
11371          149       106   1406   2.00%  pktmgr
12764      5010346  18794402    266   2.00%  ipfib
11356           79        43   1838   1.00%  adjmgr
11402          200       349    575   1.00%  netstack
12261          116        65   1799   1.00%  rpm
12271      3321325  27299929    121   1.00%  ipfib
12334     23532716  29888867    787   1.00%  l2fm
CPU util  :   57.83% user,    5.40% kernel,   36.75% idle
Please note that only processes from the requested vdc are shown above
xdist5e1# show processes cpu sort | ex 0.00

We are not running any BGP or routing protocol, its just using simple SVI and doing static routing.

satish.txt1 · ‎05-28-2019

Yes i have figured it out, issue was spanning-tree related.

During investigation found in CoPP (Control Plan Policing) dropping lots of glean & arp packets and that number in millions which is no way normal, glean is directly related to arp flooding which gave me clue to look into arp tables.

# show policy-map interface control-plane
...
...
class-map copp-s-glean (match-any)
      police pps 500
        OutPackets    3371
        DropPackets   19911477

Investigating Arp

Found arp table getting flushed out every ~85 seconds for all hosts in that connected subnets and that is **NOT** normal, you can see that in Age column in following output all hosts has same arp age time because full tables got flushed and this is happened every ~85th second around.

# sh ip arp

lets see spanning-tree, if you notice e1/36 spanning-tree just changed, that is interesting key to dig..

# show spanning-tree detail | inc ieee|occurr|from
  Number of topology changes 4 last change occurred 3287:50:33 ago
          from port-channel1
  Number of topology changes 139 last change occurred 141:18:14 ago
          from Ethernet1/47
  Number of topology changes 139 last change occurred 309:32:43 ago
          from Ethernet1/47
  Number of topology changes 5867 last change occurred 260:38:12 ago
          from Ethernet1/47
  Number of topology changes 154 last change occurred 309:32:42 ago
          from Ethernet1/47
  Number of topology changes 118639 last change occurred 0:01:06 ago
          from Ethernet1/36
  Number of topology changes 124315 last change occurred 0:01:06 ago
          from Ethernet1/36
  Number of topology changes 137 last change occurred 309:32:42 ago
          from Ethernet1/47

found e1/36 had no spanning-tree port type edge configure and host was keep rebooting somehow.. ( bad hardware)

View solution in original post

satish.txt1 · ‎05-25-2018

what is t2usd process and why its always eating high CPU during spike? also i am seeing many many logs like following, any idea what is going on?

# show system internal feature-mgr event-history errors

...
935) Event:E_DEBUG, length:165, at 822813 usecs after Fri May 25 11:47:11 2018
    [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175):  Failed to get the current status of service with uuid 0. Error code: 0x401e0005 (service not found)


936) Event:E_DEBUG, length:174, at 820098 usecs after Fri May 25 11:47:11 2018
    [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175):  Failed to get the current status of service with uuid 1342177584. Error code: 0x401e0005 (service not found)


937) Event:E_DEBUG, length:174, at 819929 usecs after Fri May 25 11:47:11 2018
    [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175):  Failed to get the current status of service with uuid 1325400368. Error code: 0x401e0005 (service not found)


938) Event:E_DEBUG, length:174, at 819751 usecs after Fri May 25 11:47:11 2018
    [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175):  Failed to get the current status of service with uuid 1308623152. Error code: 0x401e0005 (service not found)


939) Event:E_DEBUG, length:174, at 817898 usecs after Fri May 25 11:47:11 2018
    [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175):  Failed to get the current status of service with uuid 1291845936. Error code: 0x401e0005 (service not found)


940) Event:E_DEBUG, length:174, at 817730 usecs after Fri May 25 11:47:11 2018
    [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175):  Failed to get the current status of service with uuid 1275068720. Error code: 0x401e0005 (service not found)

satish.txt1 · ‎05-28-2018

No Cisco expert here can give advice?

pcweber · ‎06-20-2022

Did you ever find details about t2usd high CPU? I too have this on a Nexus switch

markm4514 · ‎05-23-2019

Hi,

I know it's been a while but I'm curious if you ever figured this out. We're seeing something similar.

satish.txt1 · ‎05-28-2019