Solved: Re: 9300 issue with %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand:

rasmus.elmholt · ‎06-28-2023

Hi

I have a Cat 9300 i have upgraded with 17.06.05, but I keep getting an CPU warning in the log on the RP, it is going on all the time according to the log.

070030: Jun 27 14:34:37.513: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.19 exceeds warning level 5.00.
070031: Jun 27 14:34:57.510: %PLATFORM-4-ELEMENT_WARNING: Switch 2 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.20 exceeds warning level 5.00.
070032: Jun 27 14:44:47.528: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.15 exceeds warning level 5.00.
070033: Jun 27 14:45:07.528: %PLATFORM-4-ELEMENT_WARNING: Switch 2 R0/0: smand: 1/RP/0: 5-Minute Load Average value 6.16 exceeds warning level 5.00.

Looking at the processor I do not see any CPU usage over the ordinary.

#show proc cpu his
                                                                  
                                                                  
                                                                  
                                                                  
      211111222221111111111111111111111111111112222222222111111111
  100                                                           
   90                                                           
   80                                                           
   70                                                           
   60                                                           
   50                                                           
   40                                                           
   30                                                           
   20                                                           
   10                                                           
     0....5....1....1....2....2....3....3....4....4....5....5....6
               0    5    0    5    0    5    0    5    0    5    0
               CPU% per second (last 60 seconds)

But if I look into the RP it is runing on full CPU on some install processes. It is 13 weeks since I last updated.

#show platform software process slot switch active r0 monito
top - 13:43:31 up 132 days, 10:24,  0 users,  load average: 6.23, 6.22, 6.19
Tasks: 356 total,   7 running, 349 sleeping,   0 stopped,   0 zombie
%Cpu(s): 61.2 us, 18.2 sy,  0.0 ni, 20.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7568.8 total,    172.8 free,   1752.8 used,   5643.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5401.0 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 2068 root      20   0   15164  14620   2988 R 100.0   0.2 184492:48 install_e+
 6003 root      20   0   15164  14556   2924 R 100.0   0.2 190206:28 install_e+
 5939 root      20   0   15156  14504   2876 R  94.7   0.2 190056:34 install_e+
17095 root      20   0   15160  14580   2948 R  94.7   0.2 190325:55 install_e+
 9887 root      20   0   15164  14580   2948 R  89.5   0.2 183095:30 install_e+
29521 root      20   0   15164  14652   3020 R  89.5   0.2 183156:26 install_e+
 9125 root      20   0 1922764  73844  60044 S   5.3   1.0   4769:41 sif_mgr
31833 root      20   0    4152   2816   2332 R   5.3   0.0   0:00.05 top
    1 root      20   0   15980  11856   7524 S   0.0   0.2  11:46.99 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:01.39 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0+
    8 root       0 -20       0      0      0 I   0.0   0.0   0:54.88 kworker/0+
    9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu+
   10 root      20   0       0      0      0 S   0.0   0.0   4:27.80 ksoftirqd+

Any idea what the install_e+ process does, and why it is using all the CPU% ?

rasmus.elmholt · ‎07-09-2023

After looking into this some more, I either need a token from TAC to get into the shell and kill the process, or reboot the switch.

View solution in original post

Flavio Miranda · ‎06-28-2023

Hi

Your CPU seems just fine, looking the past 1 minute.

This logs can be the bug

https://bst.cisco.com/bugsearch/bug/CSCvj38738

rasmus.elmholt · ‎06-28-2023

I don't think this is the bug I am hitting. As the platform shows I am using 100% CPU on 6 of the 8 cores.

#show processes cpu platform sorted
CPU utilization for five seconds: 76%, one minute: 77%, five minutes: 77%
Core 0: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 64%
Core 1: CPU utilization for five seconds:  6%, one minute:  8%, five minutes: 22%
Core 2: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 82%
Core 3: CPU utilization for five seconds:  5%, one minute:  8%, five minutes: 79%
Core 4: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 92%
Core 5: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
Core 6: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 76%
Core 7: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
   Pid    PPid    5Sec    1Min    5Min  Status        Size  Name                  
--------------------------------------------------------------------------------
 29521   29518    100%     99%     99%  R            14652  install_engine.       
  9887    9884    100%    100%     99%  R            14580  install_engine.       
  5939    5936    100%     99%     99%  R            14504  install_engine.       
  2068    2065    100%    100%     99%  R            14620  install_engine.       
 17095   17092     99%     99%     99%  R            14580  install_engine.       
  6003    6000     99%     99%     99%  R            14556  install_engine.       
 19296   18139      3%      3%      3%  S           292404  fed main event        
  8874    8120      3%      3%      3%  S           903412  linux_iosd-imag       
  9125    8501      2%      2%      2%  S            73844  sif_mgr               
 28269   27379      1%      1%      1%  S           114244  fman_fp_image         
 31967       2      0%      0%      0%  I                0  kworker/u17:2-xprtio

rasmus.elmholt · ‎06-28-2023

Looking further into this I have found a command that shows it using 100% CPU on 6/8 CPU cores:

#show processes cpu platform sorted
CPU utilization for five seconds: 76%, one minute: 77%, five minutes: 77%
Core 0: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 64%
Core 1: CPU utilization for five seconds:  6%, one minute:  8%, five minutes: 22%
Core 2: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 82%
Core 3: CPU utilization for five seconds:  5%, one minute:  8%, five minutes: 79%
Core 4: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 92%
Core 5: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
Core 6: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 76%
Core 7: CPU utilization for five seconds: 100%, one minute: 100%, five minutes: 100%
   Pid    PPid    5Sec    1Min    5Min  Status        Size  Name                  
--------------------------------------------------------------------------------
 29521   29518    100%     99%     99%  R            14652  install_engine.       
  9887    9884    100%    100%     99%  R            14580  install_engine.       
  5939    5936    100%     99%     99%  R            14504  install_engine.       
  2068    2065    100%    100%     99%  R            14620  install_engine.       
 17095   17092     99%     99%     99%  R            14580  install_engine.       
  6003    6000     99%     99%     99%  R            14556  install_engine.       
 19296   18139      3%      3%      3%  S           292404  fed main event        
  8874    8120      3%      3%      3%  S           903412  linux_iosd-imag       
  9125    8501      2%      2%      2%  S            73844  sif_mgr               
 28269   27379      1%      1%      1%  S           114244  fman_fp_image         
 31967       2      0%      0%      0%  I                0  kworker/u17:2-xprtio  
 30684   30636      0%      0%      0%  S             6812  journalctl            
 30636   29854      0%      0%      0%  S            15808  plogd

Maybe I am hitting this bug: https://quickview.cloudapps.cisco.com/quickview/bug/CSCvu01190

Flavio Miranda · ‎06-28-2023

It can be pretty much one of those bugs or a new one. This bug CSCvu01190 have a special condition which is :

Conditions: Upgrading a Catalyst 9200 device via DNAC 1.3.3.

Did you have DNAC and did you upgraded this switch?

rasmus.elmholt · ‎06-28-2023

I can't remember how the switch got upgraded but on closer inspection CSCvu01190 is fixed in 17.6.5 and this is the version we are running now.

Leo Laohoo · ‎06-28-2023

Please post the complete output to the following commands:

sh platform resources
sh platform soft status con brief

And upgrade to 17.9.3.

NOTE: 17.9.4 releases at the end of July 2023.

Have a look at the picture below.

6 x 9300

rasmus.elmholt · ‎06-28-2023

Could you tell me what the picture is showing and what issue I am hitting?

#sh platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical                                             
Resource                 Usage                 Max             Warning         Critical        State
----------------------------------------------------------------------------------------------------
 Control Processor       77.14%                100%            90%             95%             H    
  DRAM                   2960MB(39%)           7568MB          85%             90%             H    
  TMPFS                  209MB(2%)             7568MB          40%             50%             H    
#sh platform soft status con brief
Load Average
 Slot  Status  1-Min  5-Min 15-Min
1-RP0 Warning   6.18   6.31   6.34
2-RP0 Healthy   0.33   0.21   0.13
3-RP0 Healthy   0.04   0.08   0.08

Memory (kB)
 Slot  Status    Total     Used (Pct)     Free (Pct) Committed (Pct)
1-RP0 Healthy  7750428  3032340 (39%)  4718088 (61%)   3484352 (45%)
2-RP0 Healthy  7750428  2881364 (37%)  4869064 (63%)   3335940 (43%)
3-RP0 Healthy  7750436  2223412 (29%)  5527024 (71%)   1679644 (22%)

CPU Utilization
 Slot  CPU   User System   Nice   Idle    IRQ   SIRQ IOwait
1-RP0    0  76.20  23.80   0.00   0.00   0.00   0.00   0.00
         1  58.10  18.50   0.00  23.40   0.00   0.00   0.00
         2  25.77   8.29   0.00  65.83   0.00   0.09   0.00
         3   6.09   4.19   0.00  89.51   0.00   0.19   0.00
         4  81.80  18.20   0.00   0.00   0.00   0.00   0.00
         5  81.68  18.31   0.00   0.00   0.00   0.00   0.00
         6  79.30  20.70   0.00   0.00   0.00   0.00   0.00
         7  79.70  20.30   0.00   0.00   0.00   0.00   0.00
2-RP0    0   0.50   0.30   0.00  99.19   0.00   0.00   0.00
         1   0.90   0.50   0.00  98.60   0.00   0.00   0.00
         2   1.20   0.50   0.00  98.30   0.00   0.00   0.00
         3   1.30   0.50   0.00  98.20   0.00   0.00   0.00
         4   1.00   0.40   0.00  98.59   0.00   0.00   0.00
         5   0.90   0.40   0.00  98.70   0.00   0.00   0.00
         6   0.90   0.50   0.00  98.59   0.00   0.00   0.00
         7   0.50   0.30   0.00  99.19   0.00   0.00   0.00
3-RP0    0   0.50   0.30   0.00  99.20   0.00   0.00   0.00
         1   0.89   0.39   0.00  98.70   0.00   0.00   0.00
         2   0.59   0.29   0.00  99.10   0.00   0.00   0.00
         3   1.60   0.40   0.00  97.99   0.00   0.00   0.00
         4   0.79   0.19   0.00  99.00   0.00   0.00   0.00
         5   0.49   0.29   0.00  99.20   0.00   0.00   0.00
         6   0.50   0.30   0.00  99.20   0.00   0.00   0.00
         7   0.59   0.49   0.00  98.90   0.00   0.00   0.00

Leo Laohoo · ‎06-28-2023

@rasmus.elmholt wrote:
Control Processor       77.14% 

Whao! That is very high!

@rasmus.elmholt wrote:

CPU Utilization
 Slot  CPU   User System   Nice   Idle    IRQ   SIRQ IOwait
1-RP0    0  76.20  23.80   0.00   0.00   0.00   0.00   0.00
         1  58.10  18.50   0.00  23.40   0.00   0.00   0.00
         2  25.77   8.29   0.00  65.83   0.00   0.09   0.00
         3   6.09   4.19   0.00  89.51   0.00   0.19   0.00
         4  81.80  18.20   0.00   0.00   0.00   0.00   0.00
         5  81.68  18.31   0.00   0.00   0.00   0.00   0.00
         6  79.30  20.70   0.00   0.00   0.00   0.00   0.00
         7  79.70  20.30   0.00   0.00   0.00   0.00   0.00

Something in the control-plane is grinding nearly all the CPUs (except CPU #2 & 3). (Look at the column next to "CPU".)

Use the same command "sh platform software status con brief" to see if the grind is still happening. If it is not, wait until it is. IF it is, then issue the next command:

sh processes cpu platform sorted location switch 1 r0

NOTE: This output is very long. Please post only the "first page".

Leo Laohoo · ‎06-28-2023

@rasmus.elmholt wrote:
Could you tell me what the picture is showing

I have a stack of 9300 and the stack is made up of six (6) 9300. The picture is the control-plane memory utilization of a single switch, the switch master.

IOS-XE is very buggy and memory is just one of them. The picture shows that when the stack was on 17.6.4, the memory leak was extremely severe that I had to reboot the stack every 3 to 4 months. After upgrading to 17.9.3, the memory leak still occurs but the rate is not that severe.

rasmus.elmholt · ‎06-28-2023

To me it actually seems like I am hitting CSCwb13852 as I am pretty sure the upgrade process was terminated as I was disconnected from the device.

How do I kill the right process as the workaround suggests?

rasmus.elmholt · ‎07-09-2023

After looking into this some more, I either need a token from TAC to get into the shell and kill the process, or reboot the switch.

9300 issue with %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/R