10-08-2014 12:31 PM - edited 03-07-2019 09:02 PM
We are getting high cpu on at least 2 of our 3850x Stacks. Spoke to TAC he noted:
CSCuo14511 fed and stack-mgr causing High CPU on 3850
Suggesting to upgrade to 03.03.04SE
Has anyone seen this and solved it without an upgrade.
Thanks,
Tom
ST3-Stack1-3850#sho proc cpu sort | e 0.00
Core 0: CPU utilization for five seconds: 94%; one minute: 92%; five minutes: 90%
Core 1: CPU utilization for five seconds: 94%; one minute: 93%; five minutes: 94%
Core 2: CPU utilization for five seconds: 98%; one minute: 96%; five minutes: 93%
Core 3: CPU utilization for five seconds: 80%; one minute: 83%; five minutes: 89%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
5719 3399767 25647721 923 49.61 49.34 49.07 0 stack-mgr
5717 2287217 89195731 434 27.47 27.21 26.97 0 fed
10243 733637 16770188 371 13.71 13.54 13.70 34816 iosd
6250 40832 82186167 665 0.78 0.60 0.59 0 pdsd
10239 1299808 36806083 15 0.24 0.12 0.11 0 wcm
6261 692231 40238409 871 0.10 0.06 0.05 0 cpumemd
19 974400 53966893 33 0.05 0.03 0.04 0 sirq-net-rx/1
43 2452848 54120184 28 0.05 0.04 0.05 0 sirq-net-rx/3
5718 115760 32235590 20 0.05 0.09 0.10 0 platform_mgr
10240 1899130 59100638 3 0.05 0.04 0.02 0 table_mgr
ST3-Stack1-3850#sho ver
Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.03.01SE RELEASE SOFTWARE (fc1)
Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
* 1 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
2 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
3 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
4 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
5 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
6 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
7 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
8 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
9 56 WS-C3850-48P 03.03.01SE cat3k_caa-universalk9 INSTALL
10-09-2014 11:44 PM
Hi,
That DDTS is a little misleading. It is now closed. May I know the SR number that you have for this issue?
The stack-mgr process typically comes into play when you are syncing information between stacked members. However, we noted that in all the customer scenarios where this was high, there was an underlying, genuine problem which kept stack-mgr high - as an example, constant mac flushes.
Furthermore, there was another DDTS that was raised from the result of us troubleshooting this issue - the CPU might reflect high in the output of show process cpu, however, in actuality, within the kernel itself, the CPU is not high.
Regards,
Aninda
10-14-2014 12:00 PM
Aninda,
Thank you for responding to this. I'm still very interested in working on this with you as we've not resolve the problem. Here is some additional info:
SR: 632150191
SUMMARY: Cisco WS-C3850-48PW-S / high cpu
SEVERITY: 2
STB-Stack1-3850#sh sw stack-ports sum
Sw#/Port# Port Status Neighbor Cable Length Link OK Link Active Sync OK #Changes to LinkOK In Loopback
-------------------------------------------------------------------------------------------------------------------
1/1 OK 3 50cm Yes Yes Yes 2 No
1/2 OK 2 50cm Yes Yes Yes 2 No
2/1 OK 4 50cm Yes Yes Yes 3 No
2/2 OK 1 50cm Yes Yes Yes 1 No
3/1 OK 1 50cm Yes Yes Yes 1 No
3/2 OK 4 50cm Yes Yes Yes 2 No
4/1 OK 2 50cm Yes Yes Yes 1 No
4/2 OK 3 50cm Yes Yes Yes 1 No
show processes cpu sort | exclude 0.0
Core 0: CPU utilization for five seconds: 81%; one minute: 89%; five minutes: 89%
Core 1: CPU utilization for five seconds: 97%; one minute: 93%; five minutes: 94%
Core 2: CPU utilization for five seconds: 97%; one minute: 98%; five minutes: 96%
Core 3: CPU utilization for five seconds: 75%; one minute: 83%; five minutes: 84%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
5719 3166870 26072392 931 49.51 49.51 49.40 0 stack-mgr
10243 3499257 17273647 371 9.80 13.40 13.36 34816 iosd
6250 1596262 85203392 663 0.63 0.64 0.62 0 pdsd
10239 1908448 37138149 15 0.19 0.15 0.14 0 wcm
show processes cpu detailed process stack-mgr sorted | ex 0.0
Core 0: CPU utilization for five seconds: 97%; one minute: 88%; five minutes: 89%
Core 1: CPU utilization for five seconds: 98%; one minute: 97%; five minutes: 95%
Core 2: CPU utilization for five seconds: 86%; one minute: 91%; five minutes: 91%
Core 3: CPU utilization for five seconds: 90%; one minute: 86%; five minutes: 88%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5719 L 2794632 2608303 931 48.90 49.41 49.39 0 stack-mgr
5719 L 1 6176 374555 1344232 0 24.52 24.47 24.50 0 Replenish OOB
5719 L 3 6177 60081 1584971 0 23.84 24.18 24.16 0 OOBnd RX
5719 L 0 6170 3353445 2096346 0 0.49 0.53 0.51 0 IntrDrv
show platform punt statistics port-asic 0 cpuq -1 direction rx
RX (ASIC2CPU) Stats (asic 0 qn 12 lqn 12):
RXQ 12: CPU_Q_BROADCAST
----------------------------------------
Packets received from ASIC : 1374620888
Send to IOSd total attempts : 1374620888
Send to IOSd failed count : 69500195
RX suspend count : 69500195
RX unsuspend count : 69500195
RX unsuspend send count : 69503516
RX unsuspend send failed count : 3321
RX dropped count : 0
RX conversion failure dropped : 0
RX pkt_hdr allocation failure : 0
RX INTACK count : 911308956
RX packets dq'd after intack : 4945892
Active RxQ event : 998246927
RX spurious interrupt : 107703495
11-16-2015 09:10 AM
I was wondering if this issue has been resolved. Currently have several 3850 switches with this issue, ver 3.3.05SE.
Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
* 1 56 WS-C3850-48P 03.03.05SE cat3k_caa-universalk9 INSTALL
Liberty_IDF6#sh proc cpu | exc 0.00
Core 0: CPU utilization for five seconds: 83%; one minute: 80%; five minutes: 82%
Core 1: CPU utilization for five seconds: 18%; one minute: 36%; five minutes: 41%
Core 2: CPU utilization for five seconds: 59%; one minute: 25%; five minutes: 25%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
5724 1912954 385705783 558 25.5 25.5 25.6 1088 fed
5725 478337 406884338 26 0.14 0.15 0.15 0 platform_mgr
6254 1987650 2335622 851 0.05 0.01 0.05 0 oom_poll.sh
6262 2134176 153098207 59 23.7 23.2 23.0 0 pdsd
6273 2215234 49612008 823 0.05 0.04 0.05 0 cpumemd
8593 3310393 227047993 408 2.36 2.51 2.55 0 iosd
Liberty_IDF6#show process cpu detail process fed sorted | ex 0.0
Core 0: CPU utilization for five seconds: 91%; one minute: 82%; five minutes: 82%
Core 1: CPU utilization for five seconds: 44%; one minute: 37%; five minutes: 42%
Core 2: CPU utilization for five seconds: 9%; one minute: 13%; five minutes: 20%
Core 3: CPU utilization for five seconds: 61%; one minute: 77%; five minutes: 66%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5724 L 1933744 3857072 558 25.91 25.59 25.62 1088 fed
5724 L 3 9201 3369654 3374801 0 24.35 24.22 24.24 0 PunjectRx
5724 L 1 9202 4233753 1715787 0 0.39 0.25 0.24 0 PunjectTx
5724 L 1 9139 741198 3708236 0 0.19 0.19 0.18 0 Xcvr
5724 L 2 6162 2584100 4882187 0 0.15 0.22 0.24 0 fed-ots-main
Liberty_IDF6#show process cpu detail process pdsd sorted | ex 0.0
Core 0: CPU utilization for five seconds: 84%; one minute: 81%; five minutes: 82%
Core 1: CPU utilization for five seconds: 77%; one minute: 39%; five minutes: 42%
Core 2: CPU utilization for five seconds: 94%; one minute: 21%; five minutes: 22%
Core 3: CPU utilization for five seconds: 57%; one minute: 78%; five minutes: 67%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
6262 L 2166826 1530987 59 22.79 22.94 23.01 0 pdsd
09-01-2016 11:00 AM
We had the same issue on many stacks and ended up upgrading the IOS to at least Version 03.06.04.E. The issue is that the CPU seems to only get to 99% when you get months into the reboot/reload, so it's hard to troubleshoot then and there. But I can tell you we have stacks with at least 5 months running the 03.06.04 and it's fine so far. If it does ramp up, I'll be sure to post and update everyone.
02-05-2017 09:35 AM
We have this problem on 03.06.04.E ios-xe. Upgrade not resolved the issue
07-19-2017 07:43 AM
Tried cat3k_caa-universalk9.SPA.03.07.05.E.152-3.E5.bin, problem still persists. Going back to latest approved of cat3k_caa-universalk9.SPA.03.06.06.E.152-2.E6.bin to see if that helps.
10-27-2014 02:20 PM
Maybe it's worth mentioning that we are seeing this on the 3650x as well... We have 3 with 2 in a stack and another that has 4 in a stack. There hasn't been any issues with performance, but manageability has been difficult and when trying to automate tasks it even at times says there was no response. We updated from 3.3 to 3.4 as suggested by TAC, but that hasn't resolved the issue. It appeared fine for a week or so..
It appears that there is a lot of punting TX to the CPU.
10-28-2014 06:16 AM
Craig,
Thanks for the additional information. I hope TAC gets the message that an upgrade does not solve this issue.
08-27-2015 10:07 AM
Was this case ever resolved? We are seeing this in a stack of 3850s running version 03.03.05SE.
10-01-2018 11:28 AM
All,
I started this discussion 4 years ago. Because of some 802.1x incompatibilities we were forced to move from 03.03.01SE to 16.3.6 Here are the results:
ST3-3850#sho proc cpu platform
CPU utilization for five seconds: 33%, one minute: 38%, five minutes: 36%
Core 0: CPU utilization for five seconds: 31%, one minute: 38%, five minutes: 36%
Core 1: CPU utilization for five seconds: 29%, one minute: 37%, five minutes: 36%
Core 2: CPU utilization for five seconds: 31%, one minute: 38%, five minutes: 36%
Core 3: CPU utilization for five seconds: 30%, one minute: 39%, five minutes: 36%
We were able to do the upgrades with a single reboot. There is preloading and installation that can be done during production time. The reboot in all cases took 12 mins for a 2 switch stack to 13 1/2 mins for a 9 switch stack.
We have been upgrading for 2 weeks now and have done 9 stacks. In 2 cases we needed to reboot or power of 1 switch in the stack. Most of the stacks have been on for 4 years with out code change or reboot.... so 1 or 2 switch restarts out of 60 is to be expected.
We followed the following Youtube video which used Cisco recommended commands
https://www.youtube.com/watch?v=bX006rPu4pA
Thank you all for following this discussion
02-09-2016 09:03 AM
We have a few dozen 3850s out there now, all in stacks and they all exhibit the same high cpu caused by stack-mgr. We have tried different IOS versions and they all seem to creep up in CPU until they are very high, in some cases all CPUs showing 99%.
There IS a big bug open about this, you guys should check this out. It looks like they attributed the issue to 3 underlying problems/misconfigurations. Two are mac related and the other is from spanning tree recalculations. We just found this today so we are going to be investigating all 3 conditions, we believe we have a lot of STP calculations so we'll start there.
https://tools.cisco.com/bugsearch/bug/CSCuo14511
Known Affected Releases: |
(1)
|
Known Fixed Releases: |
(0)
|
09-20-2017 02:54 PM - edited 09-20-2017 02:57 PM
Standalone 3850 with 03.06.03E had the issue containing the following errors:
Sep 15 20:39:49.834: %SYS-3-CPUHOG: Task is running for (2590)msecs, more than (2000)msecs (40/40),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b :54C0E000+3F31FC4 snmp_db:3174D000+22F44 :54C0E000+151C040 :54C0E000+151916C :54C0E000+3BC7300 :54C0E000+3BC7650 :54C0E000+3BC5214 :54C0E000+3B8D594 :54C0E000+3B7A25C :54C0E000+3BAF7CC :54C0E000+3F07C3C
*Sep 15 20:39:50.875: %SYS-3-CPUHOG: Task is running for (3630)msecs, more than (2000)msecs (40/40),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b :54C0E000+3F52D84 :54C0E000+3F2C5A4 :54C0E000+3F36848 :54C0E000+151B4A4 :54C0E000+151B6F4 :54C0E000+151C1D8 :54C0E000+1519008 :54C0E000+3BC7300 :54C0E000+3BC7650 :54C0E000+3BC5214 :54C0E000+3B8D594
*Sep 15 20:39:53.321: %SNMP-3-CPUHOG: Processing GetBulk of bsnMobileStationPortNumber
*Sep 15 20:39:53.910: %SYS-3-CPUHOG: Task is running for (2710)msecs, more than (2000)msecs (0/0),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b :54C0E000+3F52AB8 :54C0E000+3F2C5A4 :54C0E000+3F36848 :54C0E000+151B4F4 :54C0E000+151B6F4 :54C0E000+151C1D8 :54C0E000+1519008 :54C0E000+3BC7300 :54C0E000+3BC7650 :54C0E000+3BC5214 :54C0E000+3B8D594
*Sep 15 20:39:54.914: %SYS-3-CPUHOG: Task is running for (3710)msecs, more than (2000)msecs (0/0),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b pthread:31AC9000+8DEC
*Sep 15 20:44:51.414: %SNMP-3-CPUHOG: Processing GetBulk of bsnMobileStationPortNumber
#########
Cisco has this bug listed under Cisco 5700 Series Wireless LAN Controllers: Cisco bug ID is CSCuy15293.
#########
Added the following to our config for the standalone:
>>> snmp-server view CSCuy15293 iso included
>>> snmp-server view CSCuy15293 enterprises.14179.2.1.4.1.19 excluded
>>> snmp-server community test view CSCuy15293 RW
Note - the view *must* be applied to all available communities in order to fully prevent the OID from being polled.
I tried the commands above for our commuinity and it did not work.
#########
Killed the SNMP engine on the switch and CPU's are now fine. Changed Orion NPM for ICMP only until issue is resolved.
#no snmp-server
NOTE: None of our 03.06.06E 3850's are not having issues and are all being monitored via SNMP through Orion NPM. Thinking upgrade would remedy this issue. If it does not it is definitely some sort of hardware issue on this switch.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide