Re: Cat. 9200 CPU overloaded with SISF Switcher Th process

mitard · ‎06-20-2023

Hi guys,

We have a 3 switches stack (Cat. 9200) running IOS XE 17.6.5 with the CPU overloaded with SISF Switcher Th process :

PID Runtime(ms) Invoked  uSecs   5Sec   1Min   5Min TTY Process
433  1401525420 72375221 19364 79.83% 81.16% 82.11%   0 SISF Switcher Th

As per issue CSCvk32439 I implemented a IPv6 filter on trunk ports but the issue remained. Eventually I disabled the DHCP snooping on all VLANs but this can only be a short term workaround.

Could anyone please suggest a long term remediation ?

Regards, Vincent

Leo Laohoo · ‎06-20-2023

Please post the complete output to the command "sh platform software status con brief".

mitard · ‎06-20-2023

Here it is (Please kindly note, as of now DHCP snooping is disabled, thus CPU is not overloaded) :

chun-hdie-blssrg-dsw1#sh platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 0.72 0.50 0.45
2-RP0 Healthy 0.25 0.24 0.19
3-RP0 Healthy 0.45 0.27 0.21

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 4028724 990376 (25%) 3038348 (75%) 1772980 (44%)
2-RP0 Healthy 4028728 951664 (24%) 3077064 (76%) 1753108 (44%)
3-RP0 Healthy 4028728 790316 (20%) 3238412 (80%) 987476 (25%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 7.70 3.59 0.00 87.97 0.61 0.10 0.00
1 5.87 3.81 0.00 89.58 0.51 0.20 0.00
2 4.76 3.10 0.00 91.40 0.51 0.20 0.00
3 4.67 3.63 0.00 91.06 0.51 0.10 0.00
2-RP0 0 2.27 2.89 0.00 94.11 0.61 0.10 0.00
1 2.90 1.65 0.00 94.81 0.51 0.10 0.00
2 2.80 2.90 0.00 93.66 0.51 0.10 0.00
3 2.15 2.35 0.00 94.87 0.51 0.10 0.00
3-RP0 0 2.22 1.92 0.00 95.23 0.50 0.10 0.00
1 1.22 2.24 0.00 96.02 0.40 0.10 0.00
2 2.65 2.55 0.00 94.38 0.30 0.10 0.00
3 2.22 1.92 0.00 95.55 0.30 0.00 0.00

Leo Laohoo · ‎06-20-2023

Nothing wrong with the control-plane.

Please post the "1st page" of the output "sh proc cpu platform sort location switch act r0".

mitard · ‎06-20-2023

Here is the result of the command :

chun-hdie-blssrg-dsw1#show processes cpu platform sorted location switch active r0
CPU utilization for five seconds: 9%, one minute: 10%, five minutes: 9%
Core 0: CPU utilization for five seconds: 10%, one minute: 10%, five minutes: 10%
Core 1: CPU utilization for five seconds: 8%, one minute: 10%, five minutes: 9%
Core 2: CPU utilization for five seconds: 11%, one minute: 10%, five minutes: 9%
Core 3: CPU utilization for five seconds: 10%, one minute: 9%, five minutes: 9%
Pid PPid 5Sec 1Min 5Min Status Size Name
--------------------------------------------------------------------------------
5786 5771 17% 16% 16% S 99408 fed main event
4610 4356 13% 12% 12% S 220328 linux_iosd-imag
36 2 7% 6% 6% S 0 ksmd
17361 17313 1% 1% 1% S 33204 fman_fp_image
4789 4779 1% 1% 1% S 16408 sif_mgr
29151 29132 0% 0% 0% S 30368 python3
29132 5148 0% 0% 0% S 2572 pman
27761 27754 0% 0% 0% S 14008 cli_agent
27754 3592 0% 0% 0% S 2584 pman
27643 27636 0% 0% 0% S 3892 cmm
27636 3592 0% 0% 0% S 2584 pman
27522 27509 0% 0% 0% S 26288 dbm
27509 3592 0% 0% 0% S 2588 pman
27281 27268 0% 0% 0% S 29928 fman_rp
27268 3592 0% 0% 0% S 2580 pman
26862 26851 0% 0% 0% S 6516 tms
26851 3592 0% 0% 0% S 2580 pman
26579 26570 0% 0% 0% S 29496 smand
26570 3592 0% 0% 0% S 2584 pman
26210 26175 0% 0% 0% S 10392 psd
26175 3592 0% 0% 0% S 2588 pman
25732 9220 0% 0% 0% S 424 sleep
25332 12031 0% 0% 0% S 424 sleep
25287 25275 0% 0% 0% S 664 sntp
25275 1 0% 0% 0% S 1220 stack_sntp.sh
24649 1 0% 0% 0% S 1824 rotee
24478 24474 0% 0% 0% S 2444 iosd_console_at
24474 24320 0% 0% 0% S 1632 bexec.sh
24320 24319 0% 0% 0% S 1524 runin_exec_proc
24319 24090 0% 0% 0% S 2464 in.telnetd
24281 12260 0% 0% 0% S 424 sleep
24090 8550 0% 0% 0% S 1528 runin_exec_proc
23586 1 0% 0% 0% S 1820 rotee
23375 23348 0% 0% 0% S 2444 iosd_console_at

Leo Laohoo · ‎06-20-2023

Nothing wrong with the Data Plane, either.

Is the SISF Switcher process still high?

mitard · ‎06-20-2023

Hi Leo,

As mentionned, as I disabled the DHCP snooping, there is no more CPU overload. However this is a workaround that cannot remain on the long term.

Regards, Vincent

Leo Laohoo · ‎06-20-2023

Upgrade to 17.9.3 and enable DHCP snooping. See if it is better.

MHM Cisco World · ‎06-20-2023

There are alot of trafic fed to cpu?

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-xe-gibraltar-16121/216746-configure-punt-inject-fed-packet-capture.html

Check this guide to see what is frame punt to your CPU'

Be careful in debug with start/stop

You already have high cpu with debug it can be issue.

mitard · ‎06-20-2023

Hi guys,

Thanks for suggestion. Since there is already a CPU overload I will 1st try to upgrade to release 17.9.3 then I'll see how to move forward.

Vincent

mitard · ‎06-20-2023

Hi,

Eventually I upgraded to 17.9.3 but the behavior remains the same. I was able to run the command suggested by Leo on another equipment where DHCP snooping is not yet disabled. Here is the result.

chun-hdie-sr54-dsw1#show platform software status control-processor brief
Load Average
 Slot  Status 1-Min 5-Min 15-Min
1-RP0 Healthy  3.00  2.22   2.00
2-RP0 Healthy  0.26  0.24   0.21

Memory (kB)
 Slot  Status   Total   Used (Pct)    Free (Pct) Committed (Pct)
1-RP0 Healthy 4014172 943136 (23%) 3071036 (77%)   1780016 (44%)
2-RP0 Healthy 4014172 904236 (23%) 3109936 (77%)   1766228 (44%)

CPU Utilization
Slot  CPU  User System Nice  Idle  IRQ SIRQ IOwait
1-RP0   0 28.18   5.67 0.00 64.56 1.26 0.31   0.00
        1 33.08   6.05 0.00 59.60 0.93 0.31   0.00
        2 39.02   4.57 0.00 55.35 0.83 0.20   0.00
        3 39.24   5.01 0.00 54.59 0.83 0.31   0.00
2-RP0   0  1.96   2.58 0.00 94.82 0.51 0.10   0.00
        1  2.15   2.87 0.00 94.35 0.51 0.10   0.00
        2  2.60   2.60 0.00 94.27 0.41 0.10   0.00
        3  2.69   3.21 0.00 93.66 0.41 0.00   0.00

chun-hdie-sr54-dsw1#show processes cpu platform sorted location switch active r0
CPU utilization for five seconds: 42%, one minute: 42%, five minutes: 42%
Core 0: CPU utilization for five seconds: 38%, one minute: 42%, five minutes: 42%
Core 1: CPU utilization for five seconds: 48%, one minute: 42%, five minutes: 43%
Core 2: CPU utilization for five seconds: 29%, one minute: 41%, five minutes: 42%
Core 3: CPU utilization for five seconds: 52%, one minute: 43%, five minutes: 41%
  Pid  PPid 5Sec 1Min 5Min Status   Size Name
--------------------------------------------------------------------------------
 3894  3835 106% 105% 104% R      222112 linux_iosd-imag
 5172  5147  35%  35%  35% S       97868 fed main event
   35     2   7%   7%   7% S           0 ksmd
 5505  5470   3%   3%   3% S        8428 btman
  728     2   2%   2%   1% S           0 lsmpi-xmit
18420 18412   1%   2%   2% S       15064 repm
16073 16040   1%   1%   1% S       33316 fman_fp_image
 7796     1   1%   1%   1% S        5452 chasync.sh
 7518  7509   1%   2%   2% S       13848 btman
 4195  4180   1%   1%   1% S       16688 sif_mgr
  729     2   1%   1%   1% S           0 lsmpi-rx
31911     2   0%   0%   0% I           0 kworker/1:0-pm
31136     2   0%   0%   0% I           0 kworker/u8:3-kverity
31124  8521   0%   0%   0% S         428 sleep
30639 11209   0%   0%   0% S         424 sleep
30146     2   0%   0%   0% I           0 kworker/u8:0-kverity
29438 11480   0%   0%   0% S         424 sleep
29198     2   0%   0%   0% I           0 kworker/0:0H-mmc_com
28930     2   0%   0%   0% I           0 kworker/3:1H
28662     2   0%   0%   0% I           0 kworker/2:0H
28535     2   0%   0%   0% I           0 kworker/3:0-cgroup_d
24251     2   0%   0%   0% S           0 SarIosdMond
23602 23589   0%   0%   0% S       30036 python3
23589  4440   0%   0%   0% S        2772 pman
23362 23334   0%   0%   0% S       15436 cli_agent
23334  3127   0%   0%   0% S        2780 pman
23191 23186   0%   0%   0% S        4004 cmm
23186  3127   0%   0%   0% S        2776 pman
23076 23070   0%   0%   0% S       26724 dbm
23070  3127   0%   0%   0% S        2780 pman
22838 22828   0%   0%   0% S       27880 fman_rp
22828  3127   0%   0%   0% S        2776 pman
22483 22464   0%   0%   0% S        6572 tms
22464  3127   0%   0%   0% S        2780 pman

I'll try debug later on a low activity timeframe.

Regards, Vincent

Leo Laohoo · ‎06-21-2023

@mitard wrote:

3894  3835 106% 105% 104% R      222112 linux_iosd-imag

Is there an SNMP monitoring going on? And how many?
Is DNAC polling this stack?

MHM Cisco World · ‎06-24-2023

Cat9k#show platform software fed switch active punt packet-capture brief

as I mention before there is some server or some host do scan and flood packet over all your network,
first share above
second show interface and check input traffic for each interface, check which port have input (unicast, broadcast, multicast) count increase rapidly
lastly
try shut port by port and monitor the CPU %,
that it.
35% fed <<- is too high

mitard · ‎06-21-2023

Yes we have 2 SNMP supervision on-going (we're under a supervision migration process, so legacy and new supervision system polls the device), but we do not have DNA center in our infrastructure.

Leo Laohoo · ‎06-21-2023

Temporarily stop SNMP and watch if the CPU cycles drop.