cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
52191
Views
0
Helpful
41
Replies

Cisco 3850x Very High CPU - Stack-mgr Process

TOM FRANCHINA
Level 1
Level 1

We are getting high cpu on at least 2 of our 3850x Stacks. Spoke to TAC he noted:

CSCuo14511    fed and stack-mgr causing High CPU on 3850

Suggesting to upgrade to 03.03.04SE

Has anyone seen this and solved it without an upgrade.

Thanks,

Tom

 

ST3-Stack1-3850#sho proc cpu sort | e 0.00
Core 0: CPU utilization for five seconds: 94%; one minute: 92%;  five minutes: 90%
Core 1: CPU utilization for five seconds: 94%; one minute: 93%;  five minutes: 94%
Core 2: CPU utilization for five seconds: 98%; one minute: 96%;  five minutes: 93%
Core 3: CPU utilization for five seconds: 80%; one minute: 83%;  five minutes: 89%
PID    Runtime(ms) Invoked  uSecs  5Sec     1Min     5Min     TTY   Process
5719   3399767     25647721 923    49.61    49.34    49.07    0     stack-mgr
5717   2287217     89195731 434    27.47    27.21    26.97    0     fed
10243  733637      16770188 371    13.71    13.54    13.70    34816 iosd
6250   40832       82186167 665    0.78     0.60     0.59     0     pdsd
10239  1299808     36806083 15     0.24     0.12     0.11     0     wcm
6261   692231      40238409 871    0.10     0.06     0.05     0     cpumemd
19     974400      53966893 33     0.05     0.03     0.04     0     sirq-net-rx/1
43     2452848     54120184 28     0.05     0.04     0.05     0     sirq-net-rx/3
5718   115760      32235590 20     0.05     0.09     0.10     0     platform_mgr
10240  1899130     59100638 3      0.05     0.04     0.02     0     table_mgr

ST3-Stack1-3850#sho ver
Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.03.01SE RELEASE SOFTWARE (fc1)

Switch Ports Model              SW Version        SW Image              Mode
------ ----- -----              ----------        ----------            ----
*    1 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     2 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     3 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     4 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     5 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     6 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     7 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     8 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     9 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL

 

41 Replies 41

Aninda Chatterjee
Cisco Employee
Cisco Employee

Hi,

That DDTS is a little misleading. It is now closed. May I know the SR number that you have for this issue?

The stack-mgr process typically comes into play when you are syncing information between stacked members. However, we noted that in all the customer scenarios where this was high, there was an underlying, genuine problem which kept stack-mgr high - as an example, constant mac flushes.

Furthermore, there was another DDTS that was raised from the result of us troubleshooting this issue - the CPU might reflect high in the output of show process cpu, however, in actuality, within the kernel itself, the CPU is not high.

Regards,

Aninda

Aninda,

Thank you for responding to this.  I'm still very interested in working on this with you as we've not resolve the problem. Here is some additional info:

SR: 632150191
SUMMARY: Cisco WS-C3850-48PW-S / high cpu
SEVERITY: 2

 

 

STB-Stack1-3850#sh sw stack-ports sum

Sw#/Port#  Port Status  Neighbor  Cable Length   Link OK   Link Active   Sync OK   #Changes to LinkOK  In Loopback
-------------------------------------------------------------------------------------------------------------------
1/1        OK           3         50cm           Yes       Yes           Yes       2                   No
1/2        OK           2         50cm           Yes       Yes           Yes       2                   No
2/1        OK           4         50cm           Yes       Yes           Yes       3                   No
2/2        OK           1         50cm           Yes       Yes           Yes       1                   No
3/1        OK           1         50cm           Yes       Yes           Yes       1                   No
3/2        OK           4         50cm           Yes       Yes           Yes       2                   No
4/1        OK           2         50cm           Yes       Yes           Yes       1                   No
4/2        OK           3         50cm           Yes       Yes           Yes       1                   No

show processes cpu sort | exclude 0.0

Core 0: CPU utilization for five seconds: 81%; one minute: 89%;  five minutes: 89%
Core 1: CPU utilization for five seconds: 97%; one minute: 93%;  five minutes: 94%
Core 2: CPU utilization for five seconds: 97%; one minute: 98%;  five minutes: 96%
Core 3: CPU utilization for five seconds: 75%; one minute: 83%;  five minutes: 84%
PID    Runtime(ms) Invoked  uSecs  5Sec     1Min     5Min     TTY   Process
5719   3166870     26072392 931    49.51    49.51    49.40    0     stack-mgr
10243  3499257     17273647 371    9.80     13.40    13.36    34816 iosd
6250   1596262     85203392 663    0.63     0.64     0.62     0     pdsd
10239  1908448     37138149 15     0.19     0.15     0.14     0     wcm

show processes cpu detailed process stack-mgr sorted | ex 0.0

Core 0: CPU utilization for five seconds: 97%; one minute: 88%; five minutes: 89%
Core 1: CPU utilization for five seconds: 98%; one minute: 97%; five minutes: 95%
Core 2: CPU utilization for five seconds: 86%; one minute: 91%; five minutes: 91%
Core 3: CPU utilization for five seconds: 90%; one minute: 86%; five minutes: 88%
PID    T C  TID    Runtime(ms) Invoked uSecs  5Sec      1Min     5Min     TTY   Process
                                               (%)       (%)      (%)
5719   L           2794632     2608303 931    48.90     49.41   49.39   0     stack-mgr
5719   L 1  6176   374555      1344232 0      24.52     24.47   24.50   0     Replenish OOB
5719   L 3  6177   60081       1584971 0      23.84     24.18   24.16   0     OOBnd RX
5719   L 0  6170   3353445     2096346 0      0.49      0.53    0.51    0     IntrDrv

show platform punt statistics port-asic 0 cpuq -1 direction rx

RX (ASIC2CPU) Stats (asic 0 qn 12 lqn 12):
RXQ 12: CPU_Q_BROADCAST
----------------------------------------
Packets received from ASIC     : 1374620888
Send to IOSd total attempts    : 1374620888
Send to IOSd failed count      : 69500195
RX suspend count               : 69500195
RX unsuspend count             : 69500195
RX unsuspend send count        : 69503516
RX unsuspend send failed count : 3321
RX dropped count               : 0
RX conversion failure dropped  : 0
RX pkt_hdr allocation failure  : 0
RX INTACK count                : 911308956
RX packets dq'd after intack   : 4945892
Active RxQ event               : 998246927
RX spurious interrupt          : 107703495

I was wondering if this issue has been resolved. Currently have several 3850 switches with this issue, ver 3.3.05SE.

Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
* 1 56 WS-C3850-48P 03.03.05SE cat3k_caa-universalk9 INSTALL

Liberty_IDF6#sh proc cpu | exc 0.00
Core 0: CPU utilization for five seconds: 83%; one minute: 80%; five minutes: 82%
Core 1: CPU utilization for five seconds: 18%; one minute: 36%; five minutes: 41%
Core 2: CPU utilization for five seconds: 59%; one minute: 25%; five minutes: 25%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
5724 1912954 385705783 558 25.5 25.5 25.6 1088 fed
5725 478337 406884338 26 0.14 0.15 0.15 0 platform_mgr
6254 1987650 2335622 851 0.05 0.01 0.05 0 oom_poll.sh
6262 2134176 153098207 59 23.7 23.2 23.0 0 pdsd
6273 2215234 49612008 823 0.05 0.04 0.05 0 cpumemd
8593 3310393 227047993 408 2.36 2.51 2.55 0 iosd

Liberty_IDF6#show process cpu detail process fed sorted | ex 0.0
Core 0: CPU utilization for five seconds: 91%; one minute: 82%; five minutes: 82%
Core 1: CPU utilization for five seconds: 44%; one minute: 37%; five minutes: 42%
Core 2: CPU utilization for five seconds: 9%; one minute: 13%; five minutes: 20%
Core 3: CPU utilization for five seconds: 61%; one minute: 77%; five minutes: 66%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5724 L 1933744 3857072 558 25.91 25.59 25.62 1088 fed
5724 L 3 9201 3369654 3374801 0 24.35 24.22 24.24 0 PunjectRx
5724 L 1 9202 4233753 1715787 0 0.39 0.25 0.24 0 PunjectTx
5724 L 1 9139 741198 3708236 0 0.19 0.19 0.18 0 Xcvr
5724 L 2 6162 2584100 4882187 0 0.15 0.22 0.24 0 fed-ots-main

Liberty_IDF6#show process cpu detail process pdsd sorted | ex 0.0
Core 0: CPU utilization for five seconds: 84%; one minute: 81%; five minutes: 82%
Core 1: CPU utilization for five seconds: 77%; one minute: 39%; five minutes: 42%
Core 2: CPU utilization for five seconds: 94%; one minute: 21%; five minutes: 22%
Core 3: CPU utilization for five seconds: 57%; one minute: 78%; five minutes: 67%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
6262 L 2166826 1530987 59 22.79 22.94 23.01 0 pdsd

We had the same issue on many stacks and ended up upgrading the IOS to at least Version 03.06.04.E.  The issue is that the CPU seems to only get to 99% when you get months into the reboot/reload, so it's hard to troubleshoot then and there.  But I can tell you we have stacks with at least 5 months running the 03.06.04 and it's fine so far.  If it does ramp up, I'll be sure to post and update everyone.

We have this problem on 03.06.04.E ios-xe. Upgrade not resolved the issue

Tried cat3k_caa-universalk9.SPA.03.07.05.E.152-3.E5.bin, problem still persists.  Going back to latest approved of cat3k_caa-universalk9.SPA.03.06.06.E.152-2.E6.bin to see if that helps.

craigbr
Level 1
Level 1

Maybe it's worth mentioning that we are seeing this on the 3650x as well... We have 3 with 2 in a stack and another that has 4 in a stack. There hasn't been any issues with performance, but manageability has been difficult and when trying to automate tasks it even at times says there was no response. We updated from 3.3 to 3.4 as suggested by TAC, but that hasn't resolved the issue. It appeared fine for a week or so..

It appears that there is a lot of punting TX to the CPU.

Craig,

Thanks for the additional information. I hope TAC gets the message that an upgrade does not solve this issue. 

David Sell
Level 1
Level 1

Was this case ever resolved? We are seeing this in a stack of 3850s running version 03.03.05SE.

All,

 

I started this discussion 4 years ago. Because of some 802.1x incompatibilities we were forced to move from 03.03.01SE to 16.3.6 Here are the results:

 

ST3-3850#sho proc cpu platform
CPU utilization for five seconds: 33%, one minute: 38%, five minutes: 36%
Core 0: CPU utilization for five seconds: 31%, one minute: 38%, five minutes: 36%
Core 1: CPU utilization for five seconds: 29%, one minute: 37%, five minutes: 36%
Core 2: CPU utilization for five seconds: 31%, one minute: 38%, five minutes: 36%
Core 3: CPU utilization for five seconds: 30%, one minute: 39%, five minutes: 36%

 

We were able to do the upgrades with a single reboot. There is preloading and installation that can be done during production time. The reboot in all cases took 12 mins for a 2 switch stack to 13 1/2 mins for a 9 switch stack. 

 

We have been upgrading for 2 weeks now and have done 9 stacks. In 2 cases we needed to reboot or power of 1 switch in the stack. Most of the stacks have been on for 4 years with out code change or reboot.... so 1 or 2 switch restarts out of 60 is to be expected. 

 

We followed the following Youtube video which used Cisco recommended commands

 

https://www.youtube.com/watch?v=bX006rPu4pA

 

Thank you all for following this discussion  

d.lachapelle
Level 1
Level 1

We have a few dozen 3850s out there now, all in stacks and they all exhibit the same high cpu caused by stack-mgr.  We have tried different IOS versions and they all seem to creep up in CPU until they are very high, in some cases all CPUs showing 99%.

There IS a big bug open about this, you guys should check this out.  It looks like they attributed the issue to 3 underlying problems/misconfigurations.  Two are mac related and the other is from spanning tree recalculations.  We just found this today so we are going to be investigating all 3 conditions, we believe we have a lot of STP calculations so we'll start there.

 https://tools.cisco.com/bugsearch/bug/CSCuo14511

fed and stack-mgr causing High CPU on 3850
CSCuo14511
Description
Symptom:
'stack-mgr' process shows high (>75%) CPU utilization. No packet forwarding impact observed in the switch

Conditions:
Observed conditions that were true for this to occur are:
Frequent mac flapping
Aggressive mac-aging timer configuration - less than or equal to 15 seconds
Topology Change Notification due to frequent Spanning-Tree changes or spanning-tree misconfiguration in the network

Workaround:
Eliminate or fix the configuration errors/events triggering the conditions mentioned above.

Further Problem Description:
Details
Last Modified:
Feb 8,2016
Status:
Terminated
Severity:
2 Severe
Product:
Cisco Catalyst 3850 Series Switches
Support Cases:
72
Known Affected Releases:
(1)
15.0(1)EZ
Known Fixed Releases:
(0)
No release planned to fix this bug

Standalone 3850 with 03.06.03E  had the issue containing the following errors:

 

Sep 15 20:39:49.834: %SYS-3-CPUHOG: Task is running for (2590)msecs, more than (2000)msecs (40/40),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b :54C0E000+3F31FC4 snmp_db:3174D000+22F44 :54C0E000+151C040 :54C0E000+151916C :54C0E000+3BC7300 :54C0E000+3BC7650 :54C0E000+3BC5214 :54C0E000+3B8D594 :54C0E000+3B7A25C :54C0E000+3BAF7CC :54C0E000+3F07C3C
*Sep 15 20:39:50.875: %SYS-3-CPUHOG: Task is running for (3630)msecs, more than (2000)msecs (40/40),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b :54C0E000+3F52D84 :54C0E000+3F2C5A4 :54C0E000+3F36848 :54C0E000+151B4A4 :54C0E000+151B6F4 :54C0E000+151C1D8 :54C0E000+1519008 :54C0E000+3BC7300 :54C0E000+3BC7650 :54C0E000+3BC5214 :54C0E000+3B8D594
*Sep 15 20:39:53.321: %SNMP-3-CPUHOG: Processing GetBulk of bsnMobileStationPortNumber
*Sep 15 20:39:53.910: %SYS-3-CPUHOG: Task is running for (2710)msecs, more than (2000)msecs (0/0),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b :54C0E000+3F52AB8 :54C0E000+3F2C5A4 :54C0E000+3F36848 :54C0E000+151B4F4 :54C0E000+151B6F4 :54C0E000+151C1D8 :54C0E000+1519008 :54C0E000+3BC7300 :54C0E000+3BC7650 :54C0E000+3BC5214 :54C0E000+3B8D594
*Sep 15 20:39:54.914: %SYS-3-CPUHOG: Task is running for (3710)msecs, more than (2000)msecs (0/0),process = SNMP ENGINE.
-Traceback= 1#b078619f1ef66f14ec3b7b78d242667b pthread:31AC9000+8DEC
*Sep 15 20:44:51.414: %SNMP-3-CPUHOG: Processing GetBulk of bsnMobileStationPortNumber

 

#########

 

Cisco has this bug listed under Cisco 5700 Series Wireless LAN Controllers: Cisco bug ID is CSCuy15293.

 

#########

 

Added the following to our config for the standalone:

 

>>> snmp-server view CSCuy15293 iso included
>>> snmp-server view CSCuy15293 enterprises.14179.2.1.4.1.19 excluded
>>> snmp-server community test view CSCuy15293 RW

 

Note - the view *must* be applied to all available communities in order to fully prevent the OID from being polled.

 

I tried the commands above for our commuinity and it did not work.

 

#########

 

Killed the SNMP engine on the switch and CPU's are now fine. Changed Orion NPM for ICMP only until issue is resolved. 

 

#no snmp-server

 

 

NOTE: None of our 03.06.06E  3850's are not having issues and are all being monitored via SNMP through Orion NPM. Thinking upgrade would remedy this issue. If it does not it is definitely some sort of hardware issue on this switch.

Review Cisco Networking for a $25 gift card