cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
9021
Views
7
Helpful
16
Replies

Cisco 9800 High CPU every 15 minutes caused by SAMsgThread process

DATHOZ
Level 1
Level 1

We have 6 pairs of 9800-80 on HA, and we noticed that on all pairs there is a process (SAMsgThread) that runs every 15 minutes that affects the 9800 controller CPU. That process SAMsgThread is the responsible for Smart Licensing operations. Depending on the time of the day the CPU hits 100% and it may affect client transactions depending on the qty of APs that are hosted in the controller.

DATHOZ_0-1664919444526.png

The controller version now is running 17.6.4 and couple weeks back was on 17.6.2.

I have a ticket opened but they are not helping much. Has anybody experienced this issue?

Looking at the the command "show license eventlog" it displays the following every 15 min:

"2022-09-30 04:19:53.428 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 04:34:53.428 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 04:49:53.560 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:04:53.486 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:19:53.554 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:34:53.561 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:49:53.464 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 06:04:53.214 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 06:19:53.462 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"

 

16 Replies 16

Leo Laohoo
Hall of Fame
Hall of Fame

Post the complete output to the following commands: 

  1. sh platform resources
  2. sh platform software status control-processor brief

 

sh platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical
Resource Usage Max Warning Critical State
----------------------------------------------------------------------------------------------------
RP0 (ok, active) H
Control Processor 9.10% 100% 80% 90% H
DRAM 9898MB(15%) 62892MB 88% 93% H
harddisk 0MB(0%) 0MB 80% 85% H
ESP0(ok, active) H
QFP H
TCAM 100cells(0%) 1048576cells 65% 85% H
DRAM 679250KB(16%) 4194304KB 85% 95% H
IRAM 14764KB(11%) 131072KB 85% 95% H
CPU Utilization 2.00% 100% 90% 95% H

sh platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 1.88 1.77 1.98
2-RP0 Healthy 1.17 1.00 1.06

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 64402204 10121344 (16%) 54280860 (84%) 18299220 (28%)
2-RP0 Healthy 64402204 7306124 (11%) 57096080 (89%) 16134092 (25%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 3.00 1.10 0.00 95.90 0.00 0.00 0.00
1 5.39 1.19 0.00 93.30 0.00 0.09 0.00
2 6.50 1.20 0.00 92.30 0.00 0.00 0.00
3 4.50 1.50 0.00 93.99 0.00 0.00 0.00
4 19.91 1.30 0.00 78.77 0.00 0.00 0.00
5 21.50 2.80 0.00 75.70 0.00 0.00 0.00
6 7.20 2.10 0.00 90.60 0.00 0.10 0.00
7 6.29 1.49 0.00 92.20 0.00 0.00 0.00
8 9.50 2.10 0.00 88.40 0.00 0.00 0.00
9 6.20 1.40 0.00 92.39 0.00 0.00 0.00
10 8.30 2.40 0.00 89.30 0.00 0.00 0.00
11 6.30 1.50 0.00 92.20 0.00 0.00 0.00
12 1.40 1.10 0.00 93.19 0.00 4.30 0.00
13 2.50 1.20 0.00 95.79 0.00 0.50 0.00
14 8.59 2.39 0.00 88.91 0.00 0.09 0.00
15 4.30 1.20 0.00 94.40 0.00 0.10 0.00
16 6.50 1.70 0.00 91.80 0.00 0.00 0.00
17 3.49 1.59 0.00 94.90 0.00 0.00 0.00
18 7.80 3.30 0.00 88.78 0.00 0.10 0.00
19 4.70 1.30 0.00 94.00 0.00 0.00 0.00
20 5.30 1.90 0.00 92.40 0.00 0.40 0.00
21 7.70 1.90 0.00 90.39 0.00 0.00 0.00
22 8.50 1.70 0.00 88.70 0.00 1.10 0.00
23 10.38 3.39 0.00 86.21 0.00 0.00 0.00
2-RP0 0 0.30 0.20 0.00 99.50 0.00 0.00 0.00
1 12.01 6.20 0.00 81.78 0.00 0.00 0.00
2 0.30 0.20 0.00 99.49 0.00 0.00 0.00
3 0.70 0.40 0.00 98.90 0.00 0.00 0.00
4 2.79 0.89 0.00 96.30 0.00 0.00 0.00
5 6.90 2.80 0.00 90.30 0.00 0.00 0.00
6 1.20 0.30 0.00 98.50 0.00 0.00 0.00
7 2.90 0.80 0.00 96.30 0.00 0.00 0.00
8 3.20 0.60 0.00 96.20 0.00 0.00 0.00
9 5.80 1.10 0.00 93.09 0.00 0.00 0.00
10 1.00 0.20 0.00 98.79 0.00 0.00 0.00
11 2.09 0.39 0.00 97.40 0.00 0.09 0.00
12 1.60 0.40 0.00 98.00 0.00 0.00 0.00
13 0.19 0.39 0.00 99.40 0.00 0.00 0.00
14 3.40 0.80 0.00 95.80 0.00 0.00 0.00
15 1.39 1.99 0.00 96.60 0.00 0.00 0.00
16 1.00 0.30 0.00 98.70 0.00 0.00 0.00
17 0.89 0.19 0.00 98.80 0.00 0.09 0.00
18 1.20 0.30 0.00 98.50 0.00 0.00 0.00
19 4.60 0.90 0.00 94.50 0.00 0.00 0.00
20 0.30 0.20 0.00 99.50 0.00 0.00 0.00
21 6.70 2.80 0.00 88.10 0.00 2.40 0.00
22 2.20 1.10 0.00 96.50 0.00 0.20 0.00
23 0.89 0.69 0.00 98.40 0.00 0.00 0.00

Raise a TAC Case. 

Memory utilization (>16%) is abnormally high.

Rich R
VIP
VIP

I'll be interested to know the outcome as we're due to upgrade production WLC's from 17.6.2 to 17.6.4 in next 2 weeks.

I have 9800-80 HA SSO in lab not showing any of those CPU spikes and not seeing any of those events in logs either.  Lab WLC has very few APs and clients though so that might be the difference.

How do you have smart licensing configured? Ours has call-home service disabled and reporting direct to CSSM using smart transport.

I have about 2900 APs on the controller, you wont see the CPU issue unless you have 2k plus running on the controller. My test controller does not have that issue. By the way, I had the issue on 17.6.2 version too

Yeah, I tried with license direct and disable. It did not make a change on the CPU spike.

Beazle
Level 1
Level 1

I have a pair of 9800-80s in HA and am also seeing the CPU spike every 15mins. How were you able to find which process was causing the issue?

Start with the complete output to the following commands: 

  1. sh version (remove the hostname)
  2. sh platform resources
  3. sh platform software status control-processor brief

Hey @Leo Laohoo

Please see attached. 

I am not seeing anything wrong with the output. 

I can, however, see the spike of CPU every 15 minutes.  Please rerun the following command when the CPUs spike because I want to take a snapshot of which CPU is actually hot spinning.

So every 15, 30, 45 or top of the hour, re-run the command several times.  

Hey @Leo Laohoo 

Does this capture what you are looking for? In the last command that I ran it looks like the linux_iosd-imag process spikes for a bit.

 


@Beazle wrote:
 I ran it looks like the linux_iosd-imag process spikes for a bit.

linux_iosd-imag is a process related- or attributed to telemetry.  Is there an SNMP server (and how many) &/or DNAC? 

@Leo Laohoo 

We have a SNMP server that polls bandwidth and traffic every 5 mins. Then we also have Cisco Prime and DNAC setup for telemetry. Do you think that could be too many devices polling the controller and causing a CPU spike?

N


@Beazle wrote:
Do you think that could be too many devices polling the controller and causing a CPU spike?

No, DNAC is. 

Try it.  Remove DNAC from polling the stack for 48 hours and compare the results.

Review Cisco Networking for a $25 gift card