11-29-2021 01:40 AM - edited 11-29-2021 02:06 AM
Hi all,
I have a stack of 4 Catalyst 3850-24X working as a distribution switch, lying in between a Nexus 7K core switch and 34 C3850 access switches/stacks. I was struggling with high CPU utilization problems happening on the distribution switch. I upgraded the firmware from 16.03.06 to 16.12.5b without a noticeable change in CPU levels.
Mainly, the processes that eat the CPU are SISF Switcher Th, Spanning Tree, and Crimson flush tr. Sometimes, MATM RP Shim Pro and VMATM Callback spark enormously causing the switch to hit 100% and eventually leading to a network outage for a considerable amount of time.
I reviewed the STP configuration on the entire network to make sure there isn't a misconfig somewhere.
Here is show version output:
Cisco IOS XE Software, Version 16.12.05b Cisco IOS Software [Gibraltar], Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 16.12.5b, RELEASE SOFTWARE (fc3) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2021 by Cisco Systems, Inc. Compiled Thu 25-Mar-21 13:09 by mcpre Cisco IOS-XE software, Copyright (c) 2005-2021 by cisco Systems, Inc. All rights reserved. Certain components of Cisco IOS-XE software are licensed under the GNU General Public License ("GPL") Version 2.0. The software code licensed under GPL Version 2.0 is free software that comes with ABSOLUTELY NO WARRANTY. You can redistribute and/or modify such GPL code under the terms of GPL Version 2.0. For more details, see the documentation or "License Notice" file accompanying the IOS-XE software, or the applicable URL provided on the flyer accompanying the IOS-XE software. ROM: IOS-XE ROMMON BOOTLDR: CAT3K_CAA Boot Loader (CAT3K_CAA-HBOOT-M) Version 4.78, RELEASE SOFTWARE (P) Switch uptime is 15 hours, 12 minutes Uptime for this control processor is 15 hours, 15 minutes System returned to ROM by Reload Command at 19:44:44 UTC Sun Nov 28 2021 System restarted at 19:50:28 UTC Sun Nov 28 2021 System image file is "flash:cat3k_caa-universalk9.16.12.05b.SPA.bin" Last reload reason: Reload Command This product contains cryptographic features and is subject to United States and local country laws governing import, export, transfer and use. Delivery of Cisco cryptographic products does not imply third-party authority to import, export, distribute or use encryption. Importers, exporters, distributors and users are responsible for compliance with U.S. and local country laws. By using this product you agree to comply with applicable laws and regulations. If you are unable to comply with U.S. and local laws, return this product immediately. A summary of U.S. laws governing Cisco cryptographic products may be found at: http://www.cisco.com/wwl/export/crypto/tool/stqrg.html If you require further assistance please contact us by sending email to export@cisco.com. Technology Package License Information: ------------------------------------------------------------------------------ Technology-package Technology-package Current Type Next reboot ------------------------------------------------------------------------------ ipservicesk9 Smart License ipservicesk9 None Subscription Smart License None Smart Licensing Status: UNREGISTERED/EVAL MODE cisco WS-C3850-24XS (MIPS) processor (revision J0) with 794888K/6147K bytes of memory. Processor board ID FCW2025F017 4 Virtual Ethernet interfaces 128 Ten Gigabit Ethernet interfaces 8 Forty Gigabit Ethernet interfaces 2048K bytes of non-volatile configuration memory. 4194304K bytes of physical memory. 255037K bytes of Crash Files at crashinfo:. 255037K bytes of Crash Files at crashinfo-2:. 255037K bytes of Crash Files at crashinfo-3:. 255037K bytes of Crash Files at crashinfo-4:. 3417161K bytes of Flash at flash:. 3417161K bytes of Flash at flash-2:. 3417161K bytes of Flash at flash-3:. 3417161K bytes of Flash at flash-4:. 0K bytes of WebUI ODM Files at webui:. Base Ethernet MAC Address : 00:56:2b:d9:18:00 Motherboard Assembly Number : 73-16649-06 Motherboard Serial Number : FOC20237ZEH Model Revision Number : J0 Motherboard Revision Number : A0 Model Number : WS-C3850-24XS System Serial Number : FCW2025F017 Switch Ports Model SW Version SW Image Mode ------ ----- ----- ---------- ---------- ---- * 1 34 WS-C3850-24XS 16.12.05b CAT3K_CAA-UNIVERSALK9 BUNDLE 2 34 WS-C3850-24XS 16.12.05b CAT3K_CAA-UNIVERSALK9 BUNDLE 3 34 WS-C3850-24XS 16.12.05b CAT3K_CAA-UNIVERSALK9 BUNDLE 4 34 WS-C3850-24XS 16.12.05b CAT3K_CAA-UNIVERSALK9 BUNDLE Switch 02 --------- Switch uptime : 15 hours, 15 minutes Base Ethernet MAC Address : 00:56:2b:fb:b3:80 Motherboard Assembly Number : 73-16649-06 Motherboard Serial Number : FOC20237ZF2 Model Revision Number : J0 Motherboard Revision Number : A0 Model Number : WS-C3850-24XS System Serial Number : FCW2025C0KA Last reload reason : Reload Command Switch 03 --------- Switch uptime : 15 hours, 15 minutes Base Ethernet MAC Address : 00:56:2b:d9:71:80 Motherboard Assembly Number : 73-16649-06 Motherboard Serial Number : FOC20237ZG0 Model Revision Number : J0 Motherboard Revision Number : A0 Model Number : WS-C3850-24XS System Serial Number : FCW2025C09R Last reload reason : Reload Command Switch 04 --------- Switch uptime : 15 hours, 15 minutes Base Ethernet MAC Address : 00:56:2b:d8:cf:00 Motherboard Assembly Number : 73-16649-06 Motherboard Serial Number : FOC20237ZNA Model Revision Number : J0 Motherboard Revision Number : A0 Model Number : WS-C3850-24XS System Serial Number : FOC2024X19X Last reload reason : Reload Command Configuration register is 0x102
A snapshot of CPU utilization:
CPU utilization for five seconds: 94%/18%; one minute: 94%; five minutes: 90% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 355 22207257 31706826 700 25.91% 22.36% 21.51% 0 SISF Switcher Th 100 6240314 300554 20762 20.47% 6.02% 6.38% 0 Crimson flush tr 250 9225912 11647469 792 9.67% 10.67% 12.53% 0 Spanning Tree 356 5010254 7654994 654 9.11% 5.43% 5.58% 0 SISF Main Thread 52 3076582 10132512 303 8.39% 3.35% 2.93% 0 ARP Snoop 126 3088462 29867557 103 3.03% 3.72% 3.61% 0 IOSXE-RP Punt Se 324 798535 10202444 78 1.43% 2.52% 2.16% 0 DAI Packet Proce 174 335234 539281 621 0.95% 4.32% 2.99% 0 MATM RP Shim Pro 80 525196 1199488 437 0.87% 2.43% 2.09% 0 IOSD ipc task 222 199145 223304 891 0.23% 0.23% 0.23% 0 CDP Protocol 539 136525 385569 354 0.15% 0.16% 0.16% 0 LLDP Protocol 305 287384 1185684 242 0.15% 3.31% 1.86% 0 IGMPSN 398 60295 1230894 48 0.15% 0.06% 0.06% 0 MMA DB TIMER 98 97747 555629 175 0.15% 0.58% 0.46% 0 cpf_process_tpQ 431 60906 1230847 49 0.15% 0.05% 0.06% 0 MMA DP TIMER 432 58013 2445117 23 0.15% 0.04% 0.05% 0 MMON MENG 15 41144 402680 102 0.07% 0.02% 0.03% 0 DB Lock Manager 538 30671 397301 77 0.07% 0.04% 0.02% 0 ONEP Network Ele 149 15627 32888 475 0.07% 0.02% 0.00% 0 SFF8472 204 60996 1230802 49 0.07% 0.04% 0.05% 0 VRRS Main thread
Any help would be highly appreciated.
Solved! Go to Solution.
12-03-2021 03:55 AM - edited 12-03-2021 06:39 AM
Thanks to everyone who tried to help.
I found a post here that solved the high CPU utilization problem: https://community.cisco.com/t5/cisco-bug-discussions/cscvk32439-ipv6-sisf-main-thread-consumes-high-cpu-dhcpv6-icmpv6/td-p/3778970
The root cause behind the issue was the dhcp snooping. Although I disabled it globally using the command "no ip dhcp snooping", it didn't really help until I used the command "no ip dhcp snooping vlan 1-4094". The CPU utilization then dropped significantly from 85%+ to 25%. Hope it will help anyone who has a similar problem.
333222221111133333222222222233333111111111133333111111111133 333555559999977777444441111144444999999999955555777779999933 100 90 80 70 60 50 40 ***** ***** 30 ******** ***** ***** ***** 20 ********************************************************** 10 ********************************************************** 0....5....1....1....2....2....3....3....4....4....5....5....6 0 5 0 5 0 5 0 5 0 5 0 CPU% per second (last 60 seconds) 333333333433333333333333333433333453333433333333343333443444 785867778366768434022445888087899016989099878989819887117018 100 90 80 70 60 50 * 40 *************** *********************************** 30 *#*#****###***##**##****##*#******#****#***###*####***#*** 20 ########################################################## 10 ########################################################## 0....5....1....1....2....2....3....3....4....4....5....5....6 0 5 0 5 0 5 0 5 0 5 0 CPU% per minute (last 60 minutes) * = maximum CPU% # = average CPU% 1 11 1 1 1 1 1 599090090999099999999999909999999999999099999999999999999999999909999999 499090090999099999999999906662233585347096569489547933123225676705869987 100 **************************** *** ******* *** ** *********** 90 ********************************************************************* 80 ************************###########################################*# 70 *###*******************############################################## 60 ##################################################################### 50 *##################################################################### 40 *##################################################################### 30 ###################################################################### 20 ###################################################################### 10 ###################################################################### 0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.. 0 5 0 5 0 5 0 5 0 5 0 5 0 CPU% per hour (last 72 hours) * = maximum CPU% # = average CPU%
11-29-2021 04:17 AM
Hello,
which switch(es) in your network is/are the root switch(es) for your Vlan(s) ?
There are numerous bugs with regard to SISF and Crimson flush, however, you are running the recommended version which supposedly includes fixes for these bugs.
That said, check if the below (bug) might apply:
Crash due to "Crimson flush transactions Process"
CSCvt76409
Symptom:
Crash due to Crimson flush transactions Process.
Conditions:
Seeing sisf mac update record error due to Not enough space.
Workaround:
- enable service internal
- device-tracking tdl-disable
Further Problem Description:
This happens only when device-tracking is enabled, which may be explicit (cli) or implicit (started by some other feature like lisp, ip dhcp snooping, dot1x, etc.)
11-29-2021 04:31 AM
Hello George,
Thanks for your response.
The root bridge for all the VLANs is the core switch (Nexus 7k).
We are using dot1x and dhcp snooping on the access layer but on the distribution. If I applied the suggested workaround on the distribution switch, would it impact dot1x or dhcp snooping on the access layer?
11-29-2021 04:51 AM
Hello,
I have no idea if that will impact the distribution switch to be honest. You might want to test this after hours...
11-29-2021 05:56 PM - edited 11-29-2021 05:59 PM
@melsayeh wrote:
WS-C3850-24XS 16.12.05b CAT3K_CAA-UNIVERSALK9 BUNDLE
Stack is on Bundle Mode. Read Cisco 3850: IOS-XE/Firmware Upgrade.
Post the complete output to the command "sh platform software status control-processor brief".
11-30-2021 02:48 AM
Hi Leo,
Can bundle mode cause CPU issues?
Here is the output of "sh platform software status control-processor brief":
Load Average Slot Status 1-Min 5-Min 15-Min 1-RP0 Healthy 2.54 2.62 2.52 2-RP0 Healthy 0.24 0.25 0.31 3-RP0 Healthy 0.09 0.28 0.30 4-RP0 Healthy 0.16 0.20 0.19 Memory (kB) Slot Status Total Used (Pct) Free (Pct) Committed (Pct) 1-RP0 Healthy 3965144 2625524 (66%) 1339620 (34%) 3569596 (90%) 2-RP0 Healthy 3965144 2488204 (63%) 1476940 (37%) 3495812 (88%) 3-RP0 Healthy 3965144 1857480 (47%) 2107664 (53%) 2378336 (60%) 4-RP0 Healthy 3965144 1854232 (47%) 2110912 (53%) 2374248 (60%) CPU Utilization Slot CPU User System Nice Idle IRQ SIRQ IOwait 1-RP0 0 14.40 6.20 0.00 79.30 0.00 0.10 0.00 1 17.71 5.10 0.00 77.17 0.00 0.00 0.00 2 28.22 10.91 0.00 60.76 0.00 0.10 0.00 3 15.26 5.68 0.00 78.94 0.00 0.09 0.00 4 11.10 3.10 0.00 85.80 0.00 0.00 0.00 5 56.60 5.70 0.00 37.70 0.00 0.00 0.00 2-RP0 0 3.40 0.70 0.00 95.90 0.00 0.00 0.00 1 3.30 1.20 0.00 95.49 0.00 0.00 0.00 2 1.70 0.00 0.00 98.29 0.00 0.00 0.00 3 3.50 1.60 0.00 94.90 0.00 0.00 0.00 4 2.99 0.59 0.00 96.40 0.00 0.00 0.00 5 3.89 1.39 0.00 94.70 0.00 0.00 0.00 3-RP0 0 3.00 1.60 0.00 95.39 0.00 0.00 0.00 1 1.90 0.10 0.00 97.99 0.00 0.00 0.00 2 2.90 1.80 0.00 95.29 0.00 0.00 0.00 3 1.20 0.30 0.00 98.49 0.00 0.00 0.00 4 4.80 2.80 0.00 92.40 0.00 0.00 0.00 5 1.60 1.50 0.00 96.89 0.00 0.00 0.00 4-RP0 0 1.99 0.29 0.00 97.70 0.00 0.00 0.00 1 0.09 0.09 0.00 99.80 0.00 0.00 0.00 2 0.49 0.39 0.00 99.10 0.00 0.00 0.00 3 1.80 0.50 0.00 97.60 0.00 0.10 0.00 4 4.60 1.90 0.00 93.50 0.00 0.00 0.00 5 2.59 0.69 0.00 96.70 0.00 0.00 0.00
11-30-2021 06:13 AM - edited 11-30-2021 06:14 AM
@melsayeh wrote:
2-RP0 Healthy 3965144 2488204 (63%) 1476940 (37%) 3495812 (88%)
That is high. Ideally, non-master switch members should be operating 45% (or less) memory. Anything higher than 50% is not good.
Post the complete output to the command "sh proc memory sort location switch 2 r0". Just post the output from the first page only.
@melsayeh wrote:
Can bundle mode cause CPU issues?
No, it is not, however, convert to Install Mode because the stack may need to be rebooted in the next 4 weeks. If not, the memory leak will cause switch 2 to crash.
11-30-2021 11:32 AM
Hi Leo,
The command "sh proc memory sort location switch 2 r0" was not recognized.
However, here is the output of "sh proc memory sorted":
Processor Pool Total: 813826352 Used: 312854864 Free: 500971488 reserve P Pool Total: 102404 Used: 88 Free: 102316 lsmpi_io Pool Total: 6295128 Used: 6294296 Free: 832 PID TTY Allocated Freed Holding Getbufs Retbufs Process 0 0 290592664 57994880 204275608 0 0 *Init* 4 0 25572152 1224672 22540744 0 0 RF Slave Main Th 80 0 452286528 73508304 11363328 13668 0 IOSD ipc task 355 0 12438335696 311122432 8836784 0 67346784 SISF Switcher Th 0 0 487840880 482370424 5379712 17618099 809532 *Dead* 305 0 144177520 128407752 5104416 6040608 4342812 IGMPSN 469 0 4228096 196880 4088216 849828 0 EEM ED Syslog 541 0 5484504 2866992 2590368 0 16884 LACP Protocol 356 0 1325106448 13733605088 2516280 3516688 0 SISF Main Thread 10 0 693515560 325412072 2377368 279462916 221759564 Pool Manager 0 0 0 0 1904896 0 0 *MallocLite* 486 0 1851192 180736 1700048 9448 0 EEM Server 273 0 1657752 324968 1349800 0 0 XDR receive 423 0 2214888 1115288 1101696 0 0 Crypto CA 1 0 10921616 9885832 1094672 0 0 Chunk Manager 332 0 1018728 124784 961872 0 0 CEF: IPv4 proces 230 0 810920 0 879920 0 0 IP ARP Adjacency 73 0 2492232 274144 712200 7236 0 Net Background 174 0 1578403512 1565096304 702280 3728220 0 MATM RP Shim Pro 413 0 462720 896 514824 0 0 EST Client 303 0 3556936 5870688 482536 20100 0 IGMPSN L2MCM 44 0 975840 496568 471992 0 0 Entity MIB API 234 0 844320 360952 458888 0 0 mDNS 317 0 1999000 3883088 444424 0 0 MLDSN L2MCM 470 0 388144 5680 439464 72316 0 EEM ED Generic 31 0 9144104 84736 431912 0 0 IPC Seat RX Cont 142 0 408384 108664 428560 0 0 SAMsgThread 435 0 398096 728 376856 17808 0 Crypto IKEv2 100 0 9894701176 9894354624 325920 0 0 Crimson flush tr 277 0 246344 34072 279336 0 0 CEF background p 539 0 944201384 254691544 279080 5220 45024 LLDP Protocol 274 0 232512 896 276616 0 0 IPC LC Message H 260 0 153728 0 270856 0 0 st_pw_oam 359 0 1824 0 262824 0 0 COPS 323 0 3753840624 3687928664 245680 13511176 0 VMATM Callback 396 0 192560 0 237560 0 0 mDNS snooping 525 0 167984 448 236536 0 0 MRIB Process 536 0 220608 1424 230936 0 0 ONEP Network Ele 478 0 2399384 4054624 217080 0 0 PM Callback 101 0 98736 86544 215736 0 0 DBAL EVENTS 495 0 262928 107696 209432 0 0 Call Home proces 222 0 680775792 276761976 207352 0 10452 CDP Protocol 351 0 140416 0 185416 0 0 L2FIB Event Disp 529 3 1958536 1842512 182920 0 0 SSH Process 89 0 283544 1328 179536 0 0 REDUNDANCY FSM 538 0 226432 106016 173120 0 0 RADIUS 102 0 752688 253256 170904 33308 0 EM_SHIM_TASK 228 0 98736 0 167736 0 0 IPAM Manager 554 0 50600 1992 167600 0 0 LICENSE AGENT 311 0 38096 448 154648 0 0 AN 232 0 83656 217272 151408 0 27836 IP Input 398 0 102792 0 147792 0 0 MMA DB TIMER 288 2 1956288 1877200 146912 0 0 SSH Process 37 0 34648208 34685848 145744 0 12936 ARP Input 178 0 100608 0 145608 0 0 radius radsec cl 206 0 26864 0 143864 0 0 PKI_SSL LSC Enro --More--
12-02-2021 02:16 AM
Hi Leo,
was the information above helpful?
12-05-2021 03:32 PM
I have provided the wrong command: sh proc memory platform sorted location switch 2 r0
12-06-2021 03:51 AM
Hi Leo,
Thanks for this, here is the first page of the command output: "sh proc memory platform sorted location switch 2 r0"
System memory: 3965144K total, 2477816K used, 1487328K free, Lowest: 1480120K Pid Text Data Stack Dynamic RSS Name ---------------------------------------------------------------------- 7671 204098 600072 136 360 600072 linux_iosd-imag 12588 247 336552 128 96192 336552 fed main event 8550 1088 124924 132 2396 124924 platform_mgr 8112 418 122384 128 2488 122384 sif_mgr 26381 8794 78544 132 2372 78544 fman_rp 20889 179 77432 128 5520 77432 sessmgrd 20800 9875 77052 132 23612 77052 fman_fp_image 25682 911 65980 132 10560 65980 smand 26699 227 58468 132 2344 58468 dbm 28469 108 42012 132 1460 42012 pubd 21342 637 32256 128 2604 32256 repm 24427 8 25904 136 7380 25904 python2.7 27464 100 20308 128 1908 20308 psd 27069 76 18664 128 76 18664 cli_agent 7911 534 17112 128 1768 17112 stack_mgr 9938 600 14308 476 2524 14308 hman 10280 99 14068 128 1364 14068 bt_logger 9041 202 13084 128 1824 13084 lman 10518 248 12760 128 1652 12760 btman 25981 306 12124 132 3204 12124 tms 12835 248 11096 128 1480 11096 btman 9244 147 10836 128 1284 10836 keyman 13706 89 10340 128 2208 10340 tams_proc 8794 1096 10008 404 7808 10008 ncd.sh 10912 1096 9804 400 7624 9804 auto_upgrade_cl 9458 1096 9760 404 7612 9760 issu_stack.sh 4189 545 9600 132 132 9600 libvirtd 6751 1096 9096 400 5276 9096 rollback_timer. 15804 1096 8804 404 7480 8804 issu_stack.sh 15796 1096 8748 404 7480 8748 issu_stack.sh 21631 112 8704 132 1660 8704 plogd 26909 170 8200 128 1124 8200 cmm 8313 345 8048 132 268 8048 nif_mgr 15229 120 7908 132 1104 7908 epc_ws_liaison 11478 1096 7404 408 5264 7404 periodic.sh 7010 1096 7256 404 3624 7256 psvp.sh 4195 1096 6884 400 3072 6884 droputil.sh 4485 1096 6832 400 3072 6832 reflector.sh 13607 130 6812 136 920 6812 tamd_proc 1 1450 6512 132 1616 6512 systemd 13471 76 6080 136 672 6080 tam_svcs_ng3k_c 7074 1096 5668 400 3452 5668 pvp.sh 4534 1096 5492 396 1680 5492 iptbl.sh 28575 82 5436 136 416 5436 pttcd 11172 1096 5388 396 3204 5388 pvp.sh 20484 1096 5056 400 2968 5056 brelay_console. 467 252 4752 132 132 4752 dbus-daemon 21750 1096 4508 404 2412 4508 btelnet.sh 21024 1096 4500 400 2400 4500 brelay.sh 4170 687 4412 132 132 4412 virtlogd 4458 10 4240 132 268 4240 rotee 6922 10 4216 132 268 4216 rotee 4643 10 4216 132 268 4216 rotee 4677 10 4156 132 268 4156 rotee 4344 10 4132 132 268 4132 rotee
12-06-2021 04:51 PM
Long-story-short, avoid using 16.12.X. PERIOD.
12-07-2021 12:58 AM
What version would you recommend?
12-07-2021 01:22 AM
Latest 16.6.X or 16.9.X.
10-11-2022 11:12 AM
I am here due to a lagging switch seemingly due to high CPU/spanning tree - we use dhcp snooping and it seems to be effecting out Extreme(aerohive) APs. It is currently on IOS 16.6.7. All the other 3650s are on 3.06.06E. Im thinking of loading that build on this one to see if it resolves the problems.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide