06-28-2023 04:54 AM
Hi
I have a Cat 9300 i have upgraded with 17.06.05, but I keep getting an CPU warning in the log on the RP, it is going on all the time according to the log.
070030: Jun 27 14:34:37.513: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.19 exceeds warning level 5.00.
070031: Jun 27 14:34:57.510: %PLATFORM-4-ELEMENT_WARNING: Switch 2 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.20 exceeds warning level 5.00.
070032: Jun 27 14:44:47.528: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.15 exceeds warning level 5.00.
070033: Jun 27 14:45:07.528: %PLATFORM-4-ELEMENT_WARNING: Switch 2 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.16 exceeds warning level 5.00.
Looking at the processor I do not see any CPU usage over the ordinary.
#show proc cpu his
211111222221111111111111111111111111111112222222222111111111
100
90
80
70
60
50
40
30
20
10
0....5....1....1....2....2....3....3....4....4....5....5....6
0 5 0 5 0 5 0 5 0 5 0
CPU% per second (last 60 seconds)
But if I look into the RP it is runing on full CPU on some install processes. It is 13 weeks since I last updated.
#show platform software process slot switch active r0 monito
top - 13:43:31 up 132 days, 10:24, 0 users, load average: 6.23, 6.22, 6.19
Tasks: 356 total, 7 running, 349 sleeping, 0 stopped, 0 zombie
%Cpu(s): 61.2 us, 18.2 sy, 0.0 ni, 20.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7568.8 total, 172.8 free, 1752.8 used, 5643.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 5401.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2068 root 20 0 15164 14620 2988 R 100.0 0.2 184492:48 install_e+
6003 root 20 0 15164 14556 2924 R 100.0 0.2 190206:28 install_e+
5939 root 20 0 15156 14504 2876 R 94.7 0.2 190056:34 install_e+
17095 root 20 0 15160 14580 2948 R 94.7 0.2 190325:55 install_e+
9887 root 20 0 15164 14580 2948 R 89.5 0.2 183095:30 install_e+
29521 root 20 0 15164 14652 3020 R 89.5 0.2 183156:26 install_e+
9125 root 20 0 1922764 73844 60044 S 5.3 1.0 4769:41 sif_mgr
31833 root 20 0 4152 2816 2332 R 5.3 0.0 0:00.05 top
1 root 20 0 15980 11856 7524 S 0.0 0.2 11:46.99 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:01.39 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0+
8 root 0 -20 0 0 0 I 0.0 0.0 0:54.88 kworker/0+
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu+
10 root 20 0 0 0 0 S 0.0 0.0 4:27.80 ksoftirqd+
Any idea what the install_e+ process does, and why it is using all the CPU% ?
Solved! Go to Solution.
07-09-2023 11:44 PM
After looking into this some more, I either need a token from TAC to get into the shell and kill the process, or reboot the switch.
06-28-2023 05:10 AM
Hi
Your CPU seems just fine, looking the past 1 minute.
This logs can be the bug
https://bst.cisco.com/bugsearch/bug/CSCvj38738
06-28-2023 05:24 AM - edited 06-28-2023 05:27 AM
I don't think this is the bug I am hitting. As the platform shows I am using 100% CPU on 6 of the 8 cores.
#show processes cpu platform sorted
CPU utilization for five seconds: 76%, one minute: 77%, five minutes: 77%
Core 0: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 64%
Core 1: CPU utilization for five seconds: 6%, one minute: 8%, five minutes: 22%
Core 2: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 82%
Core 3: CPU utilization for five seconds: 5%, one minute: 8%, five minutes: 79%
Core 4: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 92%
Core 5: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
Core 6: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 76%
Core 7: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
Pid PPid 5Sec 1Min 5Min Status Size Name
--------------------------------------------------------------------------------
29521 29518 100% 99% 99% R 14652 install_engine.
9887 9884 100% 100% 99% R 14580 install_engine.
5939 5936 100% 99% 99% R 14504 install_engine.
2068 2065 100% 100% 99% R 14620 install_engine.
17095 17092 99% 99% 99% R 14580 install_engine.
6003 6000 99% 99% 99% R 14556 install_engine.
19296 18139 3% 3% 3% S 292404 fed main event
8874 8120 3% 3% 3% S 903412 linux_iosd-imag
9125 8501 2% 2% 2% S 73844 sif_mgr
28269 27379 1% 1% 1% S 114244 fman_fp_image
31967 2 0% 0% 0% I 0 kworker/u17:2-xprtio
06-28-2023 05:26 AM
Looking further into this I have found a command that shows it using 100% CPU on 6/8 CPU cores:
#show processes cpu platform sorted
CPU utilization for five seconds: 76%, one minute: 77%, five minutes: 77%
Core 0: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 64%
Core 1: CPU utilization for five seconds: 6%, one minute: 8%, five minutes: 22%
Core 2: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 82%
Core 3: CPU utilization for five seconds: 5%, one minute: 8%, five minutes: 79%
Core 4: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 92%
Core 5: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
Core 6: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 76%
Core 7: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
Pid PPid 5Sec 1Min 5Min Status Size Name
--------------------------------------------------------------------------------
29521 29518 100% 99% 99% R 14652 install_engine.
9887 9884 100% 100% 99% R 14580 install_engine.
5939 5936 100% 99% 99% R 14504 install_engine.
2068 2065 100% 100% 99% R 14620 install_engine.
17095 17092 99% 99% 99% R 14580 install_engine.
6003 6000 99% 99% 99% R 14556 install_engine.
19296 18139 3% 3% 3% S 292404 fed main event
8874 8120 3% 3% 3% S 903412 linux_iosd-imag
9125 8501 2% 2% 2% S 73844 sif_mgr
28269 27379 1% 1% 1% S 114244 fman_fp_image
31967 2 0% 0% 0% I 0 kworker/u17:2-xprtio
30684 30636 0% 0% 0% S 6812 journalctl
30636 29854 0% 0% 0% S 15808 plogd
Maybe I am hitting this bug: https://quickview.cloudapps.cisco.com/quickview/bug/CSCvu01190
06-28-2023 05:39 AM
It can be pretty much one of those bugs or a new one. This bug CSCvu01190 have a special condition which is :
Conditions: Upgrading a Catalyst 9200 device via DNAC 1.3.3.
Did you have DNAC and did you upgraded this switch?
06-28-2023 06:23 AM
I can't remember how the switch got upgraded but on closer inspection CSCvu01190 is fixed in 17.6.5 and this is the version we are running now.
06-28-2023 05:43 AM - edited 06-28-2023 05:57 AM
Please post the complete output to the following commands:
sh platform resources
sh platform soft status con brief
And upgrade to 17.9.3.
NOTE: 17.9.4 releases at the end of July 2023.
Have a look at the picture below.
06-28-2023 06:25 AM
Could you tell me what the picture is showing and what issue I am hitting?
#sh platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical
Resource Usage Max Warning Critical State
----------------------------------------------------------------------------------------------------
Control Processor 77.14% 100% 90% 95% H
DRAM 2960MB(39%) 7568MB 85% 90% H
TMPFS 209MB(2%) 7568MB 40% 50% H
#sh platform soft status con brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Warning 6.18 6.31 6.34
2-RP0 Healthy 0.33 0.21 0.13
3-RP0 Healthy 0.04 0.08 0.08
Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 7750428 3032340 (39%) 4718088 (61%) 3484352 (45%)
2-RP0 Healthy 7750428 2881364 (37%) 4869064 (63%) 3335940 (43%)
3-RP0 Healthy 7750436 2223412 (29%) 5527024 (71%) 1679644 (22%)
CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 76.20 23.80 0.00 0.00 0.00 0.00 0.00
1 58.10 18.50 0.00 23.40 0.00 0.00 0.00
2 25.77 8.29 0.00 65.83 0.00 0.09 0.00
3 6.09 4.19 0.00 89.51 0.00 0.19 0.00
4 81.80 18.20 0.00 0.00 0.00 0.00 0.00
5 81.68 18.31 0.00 0.00 0.00 0.00 0.00
6 79.30 20.70 0.00 0.00 0.00 0.00 0.00
7 79.70 20.30 0.00 0.00 0.00 0.00 0.00
2-RP0 0 0.50 0.30 0.00 99.19 0.00 0.00 0.00
1 0.90 0.50 0.00 98.60 0.00 0.00 0.00
2 1.20 0.50 0.00 98.30 0.00 0.00 0.00
3 1.30 0.50 0.00 98.20 0.00 0.00 0.00
4 1.00 0.40 0.00 98.59 0.00 0.00 0.00
5 0.90 0.40 0.00 98.70 0.00 0.00 0.00
6 0.90 0.50 0.00 98.59 0.00 0.00 0.00
7 0.50 0.30 0.00 99.19 0.00 0.00 0.00
3-RP0 0 0.50 0.30 0.00 99.20 0.00 0.00 0.00
1 0.89 0.39 0.00 98.70 0.00 0.00 0.00
2 0.59 0.29 0.00 99.10 0.00 0.00 0.00
3 1.60 0.40 0.00 97.99 0.00 0.00 0.00
4 0.79 0.19 0.00 99.00 0.00 0.00 0.00
5 0.49 0.29 0.00 99.20 0.00 0.00 0.00
6 0.50 0.30 0.00 99.20 0.00 0.00 0.00
7 0.59 0.49 0.00 98.90 0.00 0.00 0.00
06-28-2023 04:03 PM
@rasmus.elmholt wrote:Control Processor 77.14%
Whao! That is very high!
@rasmus.elmholt wrote:CPU Utilization Slot CPU User System Nice Idle IRQ SIRQ IOwait 1-RP0 0 76.20 23.80 0.00 0.00 0.00 0.00 0.00 1 58.10 18.50 0.00 23.40 0.00 0.00 0.00 2 25.77 8.29 0.00 65.83 0.00 0.09 0.00 3 6.09 4.19 0.00 89.51 0.00 0.19 0.00 4 81.80 18.20 0.00 0.00 0.00 0.00 0.00 5 81.68 18.31 0.00 0.00 0.00 0.00 0.00 6 79.30 20.70 0.00 0.00 0.00 0.00 0.00 7 79.70 20.30 0.00 0.00 0.00 0.00 0.00
Something in the control-plane is grinding nearly all the CPUs (except CPU #2 & 3). (Look at the column next to "CPU".)
Use the same command "sh platform software status con brief" to see if the grind is still happening. If it is not, wait until it is. IF it is, then issue the next command:
sh processes cpu platform sorted location switch 1 r0
NOTE: This output is very long. Please post only the "first page".
06-28-2023 04:21 PM
@rasmus.elmholt wrote:
Could you tell me what the picture is showing
I have a stack of 9300 and the stack is made up of six (6) 9300. The picture is the control-plane memory utilization of a single switch, the switch master.
IOS-XE is very buggy and memory is just one of them. The picture shows that when the stack was on 17.6.4, the memory leak was extremely severe that I had to reboot the stack every 3 to 4 months. After upgrading to 17.9.3, the memory leak still occurs but the rate is not that severe.
06-28-2023 06:32 AM
To me it actually seems like I am hitting CSCwb13852 as I am pretty sure the upgrade process was terminated as I was disconnected from the device.
How do I kill the right process as the workaround suggests?
07-09-2023 11:44 PM
After looking into this some more, I either need a token from TAC to get into the shell and kill the process, or reboot the switch.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide