Re: 9800 wncmgrd high CPU

eglinsky2012 · ‎09-08-2023

We have two SSO pairs of 9800-80 WLCs running 17.9.4. This semester is the first time we've used them in production (we are a university). For now, we only have 1,400 APs on each. I decided to check the WNCD processes and found that wncmgrd is using 99% on both controllers.

I'm not able to find much information on this process. What does it do, and is there anything I can do to reduce that usage? Would it being high affect client connectivity?

Admittedly, a significant portion of the APs are in the default site tag due to filtering issues. (I discussed this on another thread and have to wait until after a no-change period to resolve.) So, I understand the wncd_x processes could be affected during periods of high mobility, but I haven't read anything about wncmgrd specifically.

Example outputs:

9800-Pair1#show processes cpu platform sorted | i wnc|Name
Pid PPid 5Sec 1Min 5Min Status Size Name
19462 19453 99% 99% 99% R 1441524 wncmgrd
20415 20407 18% 19% 19% S 888876 wncd_7
19725 19717 17% 16% 16% S 1034624 wncd_1
19840 19832 13% 12% 12% S 830236 wncd_2
20185 20177 12% 11% 11% S 957740 wncd_5
19610 19602 9% 9% 11% S 877984 wncd_0
20301 20292 8% 8% 9% S 936452 wncd_6
20070 20062 6% 6% 6% S 696384 wncd_4
19955 19947 0% 11% 11% S 846048 wncd_3

9800-Pair2#show processes cpu platform sorted | i wnc|Name
Pid PPid 5Sec 1Min 5Min Status Size Name
19444 19435 98% 98% 98% R 1557576 wncmgrd
19592 19584 20% 21% 21% S 807160 wncd_0
20052 20044 15% 13% 12% S 876928 wncd_4
19707 19699 15% 15% 13% S 860964 wncd_1
20284 20274 12% 12% 11% R 901612 wncd_6
19822 19814 12% 11% 10% S 831764 wncd_2
20167 20159 11% 9% 8% S 813364 wncd_5
20396 20389 8% 9% 8% S 820512 wncd_7
19937 19929 0% 15% 16% S 901700 wncd_3

Rasika Nayanajith · ‎09-08-2023

If you are getting constant high cpu on that process it does not looks good. If you have a TAC support better reach out to them to find out root cause of high CPU.

Below CLI output may provide some hints

show logging process wncmgrd internal

There are few other forum threads that may provide some help as well.

https://community.cisco.com/t5/wireless/wlc-9800-cpu-utilize-issue/td-p/4871579
https://community.cisco.com/t5/wireless/9800-40-iosd-high-cpu-utilization-1000-aps-and-ap-snmp-requests/td-p/4640366

HTH
Rasika
*** Pls rate all useful responses ***

Leo Laohoo · ‎09-08-2023

Post the complete output to the following commands:

sh platform resources
sh platform software status con brief

eglinsky2012 · ‎09-09-2023

Hi Leo,

9800-Pair1#sh platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical
Resource Usage Max Warning Critical State
----------------------------------------------------------------------------------------------------
RP0 (ok, active) H
Control Processor 8.69% 100% 80% 90% H
DRAM 10781MB(17%) 62892MB 88% 93% H
harddisk 0MB(0%) 0MB 80% 85% H
ESP0(ok, active) H
QFP H
TCAM 78cells(0%) 1048576cells 65% 85% H
DRAM 655404KB(15%) 4194304KB 85% 95% H
IRAM 14764KB(11%) 131072KB 85% 95% H
CPU Utilization 0.00% 100% 90% 95% H

9800-Pair1#sh platform software status con brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 2.37 2.44 2.35
2-RP0 Healthy 0.84 0.98 0.88

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 64402224 11040252 (17%) 53361972 (83%) 18810548 (29%)
2-RP0 Healthy 64402224 7722344 (12%) 56679880 (88%) 15624048 (24%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 2.59 1.09 0.00 96.30 0.00 0.00 0.00
1 2.10 2.70 0.00 95.20 0.00 0.00 0.00
2 0.00 0.00 0.00 100.00 0.00 0.00 0.00
3 1.80 0.40 0.00 97.70 0.00 0.10 0.00
4 3.00 1.10 0.00 95.89 0.00 0.00 0.00
5 3.10 1.60 0.00 95.30 0.00 0.00 0.00
6 2.80 1.20 0.00 95.99 0.00 0.00 0.00
7 4.10 1.90 0.00 94.00 0.00 0.00 0.00
8 2.69 0.99 0.00 96.30 0.00 0.00 0.00
9 1.99 0.49 0.00 97.50 0.00 0.00 0.00
10 4.60 1.30 0.00 94.10 0.00 0.00 0.00
11 3.30 1.00 0.00 95.69 0.00 0.00 0.00
12 1.39 0.89 0.00 96.70 0.00 0.99 0.00
13 3.60 0.50 0.00 95.89 0.00 0.00 0.00
14 78.70 20.80 0.00 0.50 0.00 0.00 0.00
15 6.19 1.49 0.00 92.30 0.00 0.00 0.00
16 1.60 0.70 0.00 97.60 0.00 0.10 0.00
17 2.40 0.80 0.00 96.80 0.00 0.00 0.00
18 3.70 1.10 0.00 95.20 0.00 0.00 0.00
19 3.29 1.19 0.00 95.00 0.00 0.49 0.00
20 14.18 1.29 0.00 84.51 0.00 0.00 0.00
21 2.69 0.69 0.00 96.60 0.00 0.00 0.00
22 4.20 0.80 0.00 95.00 0.00 0.00 0.00
23 2.70 0.40 0.00 96.79 0.00 0.10 0.00
2-RP0 0 2.50 0.20 0.00 97.29 0.00 0.00 0.00
1 2.29 0.59 0.00 97.10 0.00 0.00 0.00
2 1.50 0.50 0.00 98.00 0.00 0.00 0.00
3 0.69 0.29 0.00 99.00 0.00 0.00 0.00
4 0.79 0.49 0.00 98.70 0.00 0.00 0.00
5 0.30 0.30 0.00 99.39 0.00 0.00 0.00
6 0.59 0.09 0.00 99.20 0.00 0.09 0.00
7 0.20 0.20 0.00 99.60 0.00 0.00 0.00
8 1.40 0.60 0.00 98.00 0.00 0.00 0.00
9 0.60 1.10 0.00 98.30 0.00 0.00 0.00
10 3.10 9.70 0.00 87.20 0.00 0.00 0.00
11 2.20 3.10 0.00 94.70 0.00 0.00 0.00
12 1.50 0.30 0.00 98.19 0.00 0.00 0.00
13 1.90 0.70 0.00 97.40 0.00 0.00 0.00
14 0.50 0.30 0.00 99.19 0.00 0.00 0.00
15 0.70 0.30 0.00 98.99 0.00 0.00 0.00
16 1.60 0.30 0.00 98.09 0.00 0.00 0.00
17 1.10 0.50 0.00 98.40 0.00 0.00 0.00
18 0.50 0.30 0.00 99.20 0.00 0.00 0.00
19 1.10 0.40 0.00 98.49 0.00 0.00 0.00
20 0.90 0.20 0.00 98.89 0.00 0.00 0.00
21 1.30 0.60 0.00 98.09 0.00 0.00 0.00
22 5.60 2.30 0.00 90.79 0.00 1.30 0.00
23 0.90 0.20 0.00 98.80 0.00 0.10 0.00

Things are quieter at the moment since it's the weekend, only 800 clients. The wncd_x processes are all lower than they were before, but wncmgrd is still high.

9800-Pair1#show processes cpu platform sorted | i wnc|Name
Pid PPid 5Sec 1Min 5Min Status Size Name
19462 19453 99% 99% 98% R 1441128 wncmgrd
20415 20407 6% 5% 6% S 889024 wncd_7
19725 19717 6% 6% 6% S 1035692 wncd_1
20185 20177 4% 4% 4% S 960660 wncd_5
19840 19832 4% 4% 4% S 830784 wncd_2
20301 20292 3% 3% 3% S 944332 wncd_6
20070 20062 2% 2% 2% S 685180 wncd_4
19610 19602 2% 2% 2% S 883836 wncd_0
19955 19947 0% 4% 4% S 846876 wncd_3

Leo Laohoo · ‎09-09-2023

Even though I disagree, there was a comment made by a Cisco staff here in the forums which states that it is normal for the "wncd" process to hit 100% CPU utilization.

eglinsky2012 · ‎09-11-2023

I saw that, too, but I'm not sure if that applied to wncmgrd specifically.

I corrected the filtering issue and now all but ~50 APs in small one-off buildings with little foot traffic are being filtered to site tags. That process is still at 99% CPU. I'm opening a TAC case now.

eglinsky2012 · ‎12-22-2023

Update - TAC has correlated these two bugs:

CSCwe83994: UI Radio/Client page does not load data as Websocket IDs leaked (or) not cleaned properly (https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwe83994).

CSCwf66661 : sm_device_count_list takes too long to populate leading to websocket termination (https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwf66661).

Focusing on the first one, even though it says 17.9.5 is affected, TAC says it's actually fixed in later revisions of 17.9.5. I was offered the 17.9.5 EFT (beta) version, but I've opted to just keep patching 17.9.4/a for now and avoid using Firefox after the reload. Part of that decision was that future 17.9.5 SMUs/APSPs cannot be installed on 17.9.5 EFT.

If I remember, someday after I've upgraded to 17.9.5 or later (or if an SMU comes out for this), I'll follow up. For now, I'm not going to mark this as the solution since I haven't verified if it is.

eglinsky2012 · ‎02-08-2024

TAC updated the documentation on CSCwe83994. Bad news, 17.9.5 is affected. Good news, it's fixed in 17.9.5! Haha. Knowing that TAC previously told me that CSCwe83994 was going to be fixed in a later revision of 17.9.5, I'll believe the issue will actually be fixed once 17.9.5 is released, but in the meantime, this is entertaining:

I will say that the wncmgrd high CPU issue had gone away for a few weeks after upgrading the controllers, even after a week or so after students returned for the spring semester, but it returned when I was working in the GUI and it suddenly became unresponsive, first one WLC, and later the other. The GUI still works fine and there don't seem to be any connectivity issues, it's just an occasional glitch that seems to trigger the CPU usage. Just waiting for 17.9.5 at this point.

Leo Laohoo · ‎02-08-2024

The take-away from all this confirms my initial opinion relating to platform on IOS-XE: Reboot the platform/appliance every 6 to 12 months.

IOS-XE is a memory hog and memory leaks like a broken fire hydrant. For example, there are processes that leak that are commonly attributed to DNAC/PI (aka "telemetry" in Bug IDs) &/or DNA Spaces (aka nmspd), as an example.

Another thing is the AP "load" on the controller. We've been told (WNBU) to not "overload" the 9800-80 controller to >5000 APs.

eglinsky2012 · ‎02-08-2024

We do code upgrades at least every 6 months, anyway, so they reboot then. We're eternally chasing the holy grail version that just works, but it never comes! The uptime on the controller was only about a month before wncmgrd spiked, and it happened several days after students returned. And we aren't nearly maxing these 9800-80s out; if I recall, they only have 1,500 APs, 10,000 clients, and 3Gbps throughput max each at this point.