05-23-2018 08:08 AM - edited 03-08-2019 03:06 PM
I have L3 Cisco 3064PQ switch which is running latest "nxos.7.0.3.I4.7.bin" basically I am using for routing so my ISP 10Gbps link terminated on it and currently live traffic is 10Gbps on switch but i have notice periodic packetloss and not sure where its coming from, after digging i found when my CPU spike up to 70% i have seeing packet loss.
If you notice in below picture that 60% spike and at same time i have noticed 1 ping packet loss. How do i debug this issue and find out what is that pike for and why its happening periodically, I am noticed every 1 minute and 30second this spike is coming.. this is constantly happening..
111113641111 11 111 11 11 1 11 1 1 11 1 11 11 12 011245362136900971084282051974067736846931972921847680082588 100 90 80 70 60 # 50 ## 40 ### 30 ### # 20 ### # # # # 10 ##################################### ########### ########## 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds) # = average CPU% 655647776676716562796267574676486635666266736656375765565666 389694233637271934049753800212280577262865100886229818827948 100 90 * * 80 * * * 70 * *** **** *** ** * * * * * *** * * * * * * 60 **** ******** *** *** **** *** *** **** *** **** *********** 50 ************* *** *** **** *** *** **** *** **** *********** 40 ************* *** *** ***************** *** **** *********** 30 ************* *** ****************************************** 20 ***#**###**##******#***#*#*#***#******#**#**#****#********** 10 ############################################################ 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per minute (last 60 minutes) * = maximum CPU% # = average CPU%
CPU proc table output during spike 50%
# show processes cpu sort | ex 0.00 PID Runtime(ms) Invoked uSecs 1Sec Process ----- ----------- -------- ----- ------ ----------- 12624 641746706 456010193 1407 7.00% t2usd 27 1262455737 1006811706 1253 4.00% ksmd 11145 288596961 111352447 2591 2.00% pfmclnt 11367 113 253 448 1.00% arp 11402 200 349 575 1.00% netstack CPU util : 51.33% user, 9.62% kernel, 39.03% idle Please note that only processes from the requested vdc are shown above
PID Runtime(ms) Invoked uSecs 1Sec Process ----- ----------- -------- ----- ------ ----------- 12624 641774351 456031502 1407 26.50% t2usd 27 1262516503 1006859899 1253 8.00% ksmd 11367 113 253 448 2.00% arp 11371 149 106 1406 2.00% pktmgr 12764 5010346 18794402 266 2.00% ipfib 11356 79 43 1838 1.00% adjmgr 11402 200 349 575 1.00% netstack 12261 116 65 1799 1.00% rpm 12271 3321325 27299929 121 1.00% ipfib 12334 23532716 29888867 787 1.00% l2fm CPU util : 57.83% user, 5.40% kernel, 36.75% idle Please note that only processes from the requested vdc are shown above xdist5e1# show processes cpu sort | ex 0.00
We are not running any BGP or routing protocol, its just using simple SVI and doing static routing.
Solved! Go to Solution.
05-28-2019 10:11 AM
Yes i have figured it out, issue was spanning-tree related.
During investigation found in CoPP (Control Plan Policing) dropping lots of glean & arp packets and that number in millions which is no way normal, glean is directly related to arp flooding which gave me clue to look into arp tables.
# show policy-map interface control-plane ... ... class-map copp-s-glean (match-any) police pps 500 OutPackets 3371 DropPackets 19911477
Found arp table getting flushed out every ~85 seconds for all hosts in that connected subnets and that is **NOT** normal, you can see that in Age column in following output all hosts has same arp age time because full tables got flushed and this is happened every ~85th second around.
# sh ip arp
lets see spanning-tree, if you notice e1/36 spanning-tree just changed, that is interesting key to dig..
# show spanning-tree detail | inc ieee|occurr|from
Number of topology changes 4 last change occurred 3287:50:33 ago
from port-channel1
Number of topology changes 139 last change occurred 141:18:14 ago
from Ethernet1/47
Number of topology changes 139 last change occurred 309:32:43 ago
from Ethernet1/47
Number of topology changes 5867 last change occurred 260:38:12 ago
from Ethernet1/47
Number of topology changes 154 last change occurred 309:32:42 ago
from Ethernet1/47
Number of topology changes 118639 last change occurred 0:01:06 ago
from Ethernet1/36
Number of topology changes 124315 last change occurred 0:01:06 ago
from Ethernet1/36
Number of topology changes 137 last change occurred 309:32:42 ago
from Ethernet1/47
found e1/36 had no spanning-tree port type edge configure and host was keep rebooting somehow.. ( bad hardware)
05-25-2018 04:47 AM - edited 05-25-2018 09:06 AM
what is t2usd process and why its always eating high CPU during spike? also i am seeing many many logs like following, any idea what is going on?
# show system internal feature-mgr event-history errors ... 935) Event:E_DEBUG, length:165, at 822813 usecs after Fri May 25 11:47:11 2018 [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175): Failed to get the current status of service with uuid 0. Error code: 0x401e0005 (service not found) 936) Event:E_DEBUG, length:174, at 820098 usecs after Fri May 25 11:47:11 2018 [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175): Failed to get the current status of service with uuid 1342177584. Error code: 0x401e0005 (service not found) 937) Event:E_DEBUG, length:174, at 819929 usecs after Fri May 25 11:47:11 2018 [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175): Failed to get the current status of service with uuid 1325400368. Error code: 0x401e0005 (service not found) 938) Event:E_DEBUG, length:174, at 819751 usecs after Fri May 25 11:47:11 2018 [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175): Failed to get the current status of service with uuid 1308623152. Error code: 0x401e0005 (service not found) 939) Event:E_DEBUG, length:174, at 817898 usecs after Fri May 25 11:47:11 2018 [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175): Failed to get the current status of service with uuid 1291845936. Error code: 0x401e0005 (service not found) 940) Event:E_DEBUG, length:174, at 817730 usecs after Fri May 25 11:47:11 2018 [101] fm_handle_cmi_get_feature_status(1101): (vdc=1, pid=11175): Failed to get the current status of service with uuid 1275068720. Error code: 0x401e0005 (service not found)
05-28-2018 08:21 AM
No Cisco expert here can give advice?
06-20-2022 10:46 AM
Did you ever find details about t2usd high CPU? I too have this on a Nexus switch
05-23-2019 08:40 PM
Hi,
I know it's been a while but I'm curious if you ever figured this out. We're seeing something similar.
05-28-2019 10:11 AM
Yes i have figured it out, issue was spanning-tree related.
During investigation found in CoPP (Control Plan Policing) dropping lots of glean & arp packets and that number in millions which is no way normal, glean is directly related to arp flooding which gave me clue to look into arp tables.
# show policy-map interface control-plane ... ... class-map copp-s-glean (match-any) police pps 500 OutPackets 3371 DropPackets 19911477
Found arp table getting flushed out every ~85 seconds for all hosts in that connected subnets and that is **NOT** normal, you can see that in Age column in following output all hosts has same arp age time because full tables got flushed and this is happened every ~85th second around.
# sh ip arp
lets see spanning-tree, if you notice e1/36 spanning-tree just changed, that is interesting key to dig..
# show spanning-tree detail | inc ieee|occurr|from
Number of topology changes 4 last change occurred 3287:50:33 ago
from port-channel1
Number of topology changes 139 last change occurred 141:18:14 ago
from Ethernet1/47
Number of topology changes 139 last change occurred 309:32:43 ago
from Ethernet1/47
Number of topology changes 5867 last change occurred 260:38:12 ago
from Ethernet1/47
Number of topology changes 154 last change occurred 309:32:42 ago
from Ethernet1/47
Number of topology changes 118639 last change occurred 0:01:06 ago
from Ethernet1/36
Number of topology changes 124315 last change occurred 0:01:06 ago
from Ethernet1/36
Number of topology changes 137 last change occurred 309:32:42 ago
from Ethernet1/47
found e1/36 had no spanning-tree port type edge configure and host was keep rebooting somehow.. ( bad hardware)
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide