Cisco 9800 High CPU every 15 minutes caused by SAMsgThread process
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-04-2022 02:49 PM - edited 10-04-2022 04:00 PM
We have 6 pairs of 9800-80 on HA, and we noticed that on all pairs there is a process (SAMsgThread) that runs every 15 minutes that affects the 9800 controller CPU. That process SAMsgThread is the responsible for Smart Licensing operations. Depending on the time of the day the CPU hits 100% and it may affect client transactions depending on the qty of APs that are hosted in the controller.
The controller version now is running 17.6.4 and couple weeks back was on 17.6.2.
I have a ticket opened but they are not helping much. Has anybody experienced this issue?
Looking at the the command "show license eventlog" it displays the following every 15 min:
"2022-09-30 04:19:53.428 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 04:34:53.428 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 04:49:53.560 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:04:53.486 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:19:53.554 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:34:53.561 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 05:49:53.464 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 06:04:53.214 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
2022-09-30 06:19:53.462 MST SAEVT_HA_MESSAGE messageType="SmartAgentHaMsgTSFileChange"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-04-2022 07:47 PM
Post the complete output to the following commands:
- sh platform resources
- sh platform software status control-processor brief
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2022 07:50 AM
sh platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical
Resource Usage Max Warning Critical State
----------------------------------------------------------------------------------------------------
RP0 (ok, active) H
Control Processor 9.10% 100% 80% 90% H
DRAM 9898MB(15%) 62892MB 88% 93% H
harddisk 0MB(0%) 0MB 80% 85% H
ESP0(ok, active) H
QFP H
TCAM 100cells(0%) 1048576cells 65% 85% H
DRAM 679250KB(16%) 4194304KB 85% 95% H
IRAM 14764KB(11%) 131072KB 85% 95% H
CPU Utilization 2.00% 100% 90% 95% H
sh platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 1.88 1.77 1.98
2-RP0 Healthy 1.17 1.00 1.06
Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 64402204 10121344 (16%) 54280860 (84%) 18299220 (28%)
2-RP0 Healthy 64402204 7306124 (11%) 57096080 (89%) 16134092 (25%)
CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 3.00 1.10 0.00 95.90 0.00 0.00 0.00
1 5.39 1.19 0.00 93.30 0.00 0.09 0.00
2 6.50 1.20 0.00 92.30 0.00 0.00 0.00
3 4.50 1.50 0.00 93.99 0.00 0.00 0.00
4 19.91 1.30 0.00 78.77 0.00 0.00 0.00
5 21.50 2.80 0.00 75.70 0.00 0.00 0.00
6 7.20 2.10 0.00 90.60 0.00 0.10 0.00
7 6.29 1.49 0.00 92.20 0.00 0.00 0.00
8 9.50 2.10 0.00 88.40 0.00 0.00 0.00
9 6.20 1.40 0.00 92.39 0.00 0.00 0.00
10 8.30 2.40 0.00 89.30 0.00 0.00 0.00
11 6.30 1.50 0.00 92.20 0.00 0.00 0.00
12 1.40 1.10 0.00 93.19 0.00 4.30 0.00
13 2.50 1.20 0.00 95.79 0.00 0.50 0.00
14 8.59 2.39 0.00 88.91 0.00 0.09 0.00
15 4.30 1.20 0.00 94.40 0.00 0.10 0.00
16 6.50 1.70 0.00 91.80 0.00 0.00 0.00
17 3.49 1.59 0.00 94.90 0.00 0.00 0.00
18 7.80 3.30 0.00 88.78 0.00 0.10 0.00
19 4.70 1.30 0.00 94.00 0.00 0.00 0.00
20 5.30 1.90 0.00 92.40 0.00 0.40 0.00
21 7.70 1.90 0.00 90.39 0.00 0.00 0.00
22 8.50 1.70 0.00 88.70 0.00 1.10 0.00
23 10.38 3.39 0.00 86.21 0.00 0.00 0.00
2-RP0 0 0.30 0.20 0.00 99.50 0.00 0.00 0.00
1 12.01 6.20 0.00 81.78 0.00 0.00 0.00
2 0.30 0.20 0.00 99.49 0.00 0.00 0.00
3 0.70 0.40 0.00 98.90 0.00 0.00 0.00
4 2.79 0.89 0.00 96.30 0.00 0.00 0.00
5 6.90 2.80 0.00 90.30 0.00 0.00 0.00
6 1.20 0.30 0.00 98.50 0.00 0.00 0.00
7 2.90 0.80 0.00 96.30 0.00 0.00 0.00
8 3.20 0.60 0.00 96.20 0.00 0.00 0.00
9 5.80 1.10 0.00 93.09 0.00 0.00 0.00
10 1.00 0.20 0.00 98.79 0.00 0.00 0.00
11 2.09 0.39 0.00 97.40 0.00 0.09 0.00
12 1.60 0.40 0.00 98.00 0.00 0.00 0.00
13 0.19 0.39 0.00 99.40 0.00 0.00 0.00
14 3.40 0.80 0.00 95.80 0.00 0.00 0.00
15 1.39 1.99 0.00 96.60 0.00 0.00 0.00
16 1.00 0.30 0.00 98.70 0.00 0.00 0.00
17 0.89 0.19 0.00 98.80 0.00 0.09 0.00
18 1.20 0.30 0.00 98.50 0.00 0.00 0.00
19 4.60 0.90 0.00 94.50 0.00 0.00 0.00
20 0.30 0.20 0.00 99.50 0.00 0.00 0.00
21 6.70 2.80 0.00 88.10 0.00 2.40 0.00
22 2.20 1.10 0.00 96.50 0.00 0.20 0.00
23 0.89 0.69 0.00 98.40 0.00 0.00 0.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2022 03:15 PM
Raise a TAC Case.
Memory utilization (>16%) is abnormally high.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-06-2022 08:21 AM - edited 06-27-2023 07:21 AM
I'll be interested to know the outcome as we're due to upgrade production WLC's from 17.6.2 to 17.6.4 in next 2 weeks.
I have 9800-80 HA SSO in lab not showing any of those CPU spikes and not seeing any of those events in logs either. Lab WLC has very few APs and clients though so that might be the difference.
How do you have smart licensing configured? Ours has call-home service disabled and reporting direct to CSSM using smart transport.
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-06-2022 03:35 PM
I have about 2900 APs on the controller, you wont see the CPU issue unless you have 2k plus running on the controller. My test controller does not have that issue. By the way, I had the issue on 17.6.2 version too
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-06-2022 03:37 PM
Yeah, I tried with license direct and disable. It did not make a change on the CPU spike.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-26-2023 11:44 AM
I have a pair of 9800-80s in HA and am also seeing the CPU spike every 15mins. How were you able to find which process was causing the issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-26-2023 04:39 PM
Start with the complete output to the following commands:
- sh version (remove the hostname)
- sh platform resources
- sh platform software status control-processor brief
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-27-2023 05:25 AM
Hey @Leo Laohoo,
Please see attached.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-27-2023 06:07 AM
I am not seeing anything wrong with the output.
I can, however, see the spike of CPU every 15 minutes. Please rerun the following command when the CPUs spike because I want to take a snapshot of which CPU is actually hot spinning.
So every 15, 30, 45 or top of the hour, re-run the command several times.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-27-2023 06:44 AM
Hey @Leo Laohoo
Does this capture what you are looking for? In the last command that I ran it looks like the linux_iosd-imag process spikes for a bit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-27-2023 04:17 PM
@Beazle wrote:
I ran it looks like the linux_iosd-imag process spikes for a bit.
linux_iosd-imag is a process related- or attributed to telemetry. Is there an SNMP server (and how many) &/or DNAC?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2023 05:12 AM
We have a SNMP server that polls bandwidth and traffic every 5 mins. Then we also have Cisco Prime and DNAC setup for telemetry. Do you think that could be too many devices polling the controller and causing a CPU spike?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2023 05:55 AM
N
@Beazle wrote:
Do you think that could be too many devices polling the controller and causing a CPU spike?
No, DNAC is.
Try it. Remove DNAC from polling the stack for 48 hours and compare the results.
