cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1504
Views
3
Helpful
14
Replies

Cat. 9200 CPU overloaded with SISF Switcher Th process

mitard
Level 1
Level 1

Hi guys,

We have a 3 switches stack (Cat. 9200) running IOS XE 17.6.5 with the CPU overloaded with SISF Switcher Th process :

PID Runtime(ms) Invoked  uSecs   5Sec   1Min   5Min TTY Process
433 1401525420 72375221 19364 79.83% 81.16% 82.11% 0 SISF Switcher Th

As per issue CSCvk32439 I implemented a IPv6 filter on trunk ports but the issue remained. Eventually I disabled the DHCP snooping on all VLANs but this can only be a short term workaround.

Could anyone please suggest a long term remediation ?

Regards, Vincent

 

14 Replies 14

Leo Laohoo
Hall of Fame
Hall of Fame

Please post the complete output to the command "sh platform software status con brief".

Here  it is (Please kindly note, as of now DHCP snooping is disabled, thus CPU is not overloaded) :

chun-hdie-blssrg-dsw1#sh platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 0.72 0.50 0.45
2-RP0 Healthy 0.25 0.24 0.19
3-RP0 Healthy 0.45 0.27 0.21

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 4028724 990376 (25%) 3038348 (75%) 1772980 (44%)
2-RP0 Healthy 4028728 951664 (24%) 3077064 (76%) 1753108 (44%)
3-RP0 Healthy 4028728 790316 (20%) 3238412 (80%) 987476 (25%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 7.70 3.59 0.00 87.97 0.61 0.10 0.00
1 5.87 3.81 0.00 89.58 0.51 0.20 0.00
2 4.76 3.10 0.00 91.40 0.51 0.20 0.00
3 4.67 3.63 0.00 91.06 0.51 0.10 0.00
2-RP0 0 2.27 2.89 0.00 94.11 0.61 0.10 0.00
1 2.90 1.65 0.00 94.81 0.51 0.10 0.00
2 2.80 2.90 0.00 93.66 0.51 0.10 0.00
3 2.15 2.35 0.00 94.87 0.51 0.10 0.00
3-RP0 0 2.22 1.92 0.00 95.23 0.50 0.10 0.00
1 1.22 2.24 0.00 96.02 0.40 0.10 0.00
2 2.65 2.55 0.00 94.38 0.30 0.10 0.00
3 2.22 1.92 0.00 95.55 0.30 0.00 0.00

Nothing wrong with the control-plane. 

Please post the "1st page" of the output "sh proc cpu platform sort location switch act r0".

Here is the result of the command :

chun-hdie-blssrg-dsw1#show processes cpu platform sorted location switch active r0
CPU utilization for five seconds: 9%, one minute: 10%, five minutes: 9%
Core 0: CPU utilization for five seconds: 10%, one minute: 10%, five minutes: 10%
Core 1: CPU utilization for five seconds: 8%, one minute: 10%, five minutes: 9%
Core 2: CPU utilization for five seconds: 11%, one minute: 10%, five minutes: 9%
Core 3: CPU utilization for five seconds: 10%, one minute: 9%, five minutes: 9%
Pid PPid 5Sec 1Min 5Min Status Size Name
--------------------------------------------------------------------------------
5786 5771 17% 16% 16% S 99408 fed main event
4610 4356 13% 12% 12% S 220328 linux_iosd-imag
36 2 7% 6% 6% S 0 ksmd
17361 17313 1% 1% 1% S 33204 fman_fp_image
4789 4779 1% 1% 1% S 16408 sif_mgr
29151 29132 0% 0% 0% S 30368 python3
29132 5148 0% 0% 0% S 2572 pman
27761 27754 0% 0% 0% S 14008 cli_agent
27754 3592 0% 0% 0% S 2584 pman
27643 27636 0% 0% 0% S 3892 cmm
27636 3592 0% 0% 0% S 2584 pman
27522 27509 0% 0% 0% S 26288 dbm
27509 3592 0% 0% 0% S 2588 pman
27281 27268 0% 0% 0% S 29928 fman_rp
27268 3592 0% 0% 0% S 2580 pman
26862 26851 0% 0% 0% S 6516 tms
26851 3592 0% 0% 0% S 2580 pman
26579 26570 0% 0% 0% S 29496 smand
26570 3592 0% 0% 0% S 2584 pman
26210 26175 0% 0% 0% S 10392 psd
26175 3592 0% 0% 0% S 2588 pman
25732 9220 0% 0% 0% S 424 sleep
25332 12031 0% 0% 0% S 424 sleep
25287 25275 0% 0% 0% S 664 sntp
25275 1 0% 0% 0% S 1220 stack_sntp.sh
24649 1 0% 0% 0% S 1824 rotee
24478 24474 0% 0% 0% S 2444 iosd_console_at
24474 24320 0% 0% 0% S 1632 bexec.sh
24320 24319 0% 0% 0% S 1524 runin_exec_proc
24319 24090 0% 0% 0% S 2464 in.telnetd
24281 12260 0% 0% 0% S 424 sleep
24090 8550 0% 0% 0% S 1528 runin_exec_proc
23586 1 0% 0% 0% S 1820 rotee
23375 23348 0% 0% 0% S 2444 iosd_console_at

Nothing wrong with the Data Plane, either. 

Is the SISF Switcher process still high?  

Hi Leo,

As mentionned, as I disabled the DHCP snooping, there is no more CPU overload. However this is a workaround that cannot remain on the long term.

Regards, Vincent

Upgrade to 17.9.3 and enable DHCP snooping.  See if it is better.

There are alot of trafic fed to cpu?

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-xe-gibraltar-16121/216746-configure-punt-inject-fed-packet-capture.html

Check this guide to see what is frame punt to your CPU'

Be careful in debug with start/stop

You already have high cpu with debug it can be issue. 

mitard
Level 1
Level 1

Hi guys,

Thanks for suggestion. Since there is already a CPU overload I will 1st try to upgrade to release 17.9.3 then I'll see how to move forward.

Vincent

mitard
Level 1
Level 1

Hi,

Eventually I upgraded to 17.9.3 but the behavior remains the same. I was able to run the command suggested by Leo on another equipment where DHCP snooping is not yet disabled. Here is the result.

chun-hdie-sr54-dsw1#show platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 3.00 2.22 2.00
2-RP0 Healthy 0.26 0.24 0.21

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 4014172 943136 (23%) 3071036 (77%) 1780016 (44%)
2-RP0 Healthy 4014172 904236 (23%) 3109936 (77%) 1766228 (44%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 28.18 5.67 0.00 64.56 1.26 0.31 0.00
1 33.08 6.05 0.00 59.60 0.93 0.31 0.00
2 39.02 4.57 0.00 55.35 0.83 0.20 0.00
3 39.24 5.01 0.00 54.59 0.83 0.31 0.00
2-RP0 0 1.96 2.58 0.00 94.82 0.51 0.10 0.00
1 2.15 2.87 0.00 94.35 0.51 0.10 0.00
2 2.60 2.60 0.00 94.27 0.41 0.10 0.00
3 2.69 3.21 0.00 93.66 0.41 0.00 0.00

chun-hdie-sr54-dsw1#show processes cpu platform sorted location switch active r0
CPU utilization for five seconds: 42%, one minute: 42%, five minutes: 42%
Core 0: CPU utilization for five seconds: 38%, one minute: 42%, five minutes: 42%
Core 1: CPU utilization for five seconds: 48%, one minute: 42%, five minutes: 43%
Core 2: CPU utilization for five seconds: 29%, one minute: 41%, five minutes: 42%
Core 3: CPU utilization for five seconds: 52%, one minute: 43%, five minutes: 41%
Pid PPid 5Sec 1Min 5Min Status Size Name
--------------------------------------------------------------------------------
3894 3835 106% 105% 104% R 222112 linux_iosd-imag
5172 5147 35% 35% 35% S 97868 fed main event
35 2 7% 7% 7% S 0 ksmd
5505 5470 3% 3% 3% S 8428 btman
728 2 2% 2% 1% S 0 lsmpi-xmit
18420 18412 1% 2% 2% S 15064 repm
16073 16040 1% 1% 1% S 33316 fman_fp_image
7796 1 1% 1% 1% S 5452 chasync.sh
7518 7509 1% 2% 2% S 13848 btman
4195 4180 1% 1% 1% S 16688 sif_mgr
729 2 1% 1% 1% S 0 lsmpi-rx
31911 2 0% 0% 0% I 0 kworker/1:0-pm
31136 2 0% 0% 0% I 0 kworker/u8:3-kverity
31124 8521 0% 0% 0% S 428 sleep
30639 11209 0% 0% 0% S 424 sleep
30146 2 0% 0% 0% I 0 kworker/u8:0-kverity
29438 11480 0% 0% 0% S 424 sleep
29198 2 0% 0% 0% I 0 kworker/0:0H-mmc_com
28930 2 0% 0% 0% I 0 kworker/3:1H
28662 2 0% 0% 0% I 0 kworker/2:0H
28535 2 0% 0% 0% I 0 kworker/3:0-cgroup_d
24251 2 0% 0% 0% S 0 SarIosdMond
23602 23589 0% 0% 0% S 30036 python3
23589 4440 0% 0% 0% S 2772 pman
23362 23334 0% 0% 0% S 15436 cli_agent
23334 3127 0% 0% 0% S 2780 pman
23191 23186 0% 0% 0% S 4004 cmm
23186 3127 0% 0% 0% S 2776 pman
23076 23070 0% 0% 0% S 26724 dbm
23070 3127 0% 0% 0% S 2780 pman
22838 22828 0% 0% 0% S 27880 fman_rp
22828 3127 0% 0% 0% S 2776 pman
22483 22464 0% 0% 0% S 6572 tms
22464 3127 0% 0% 0% S 2780 pman

I'll try debug later on a low activity timeframe.

Regards, Vincent


@mitard wrote:
3894  3835 106% 105% 104% R      222112 linux_iosd-imag

Is there an SNMP monitoring going on?  And how many? 
Is DNAC polling this stack?

Cat9k#show platform software fed switch active punt packet-capture brief

as I mention before there is some server or some host do scan and flood packet over all your network, 
first share above 
second show interface and check input traffic for each interface, check which port have input (unicast, broadcast, multicast) count increase rapidly
lastly 
try shut port by port and monitor the CPU %, 
that it. 
35% fed <<- is too high  

mitard
Level 1
Level 1

Yes we have 2 SNMP supervision on-going (we're under a supervision migration process, so legacy and new supervision system polls the device), but we do not have DNA center in our infrastructure.

Temporarily stop SNMP and watch if the CPU cycles drop.