cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
52314
Views
0
Helpful
41
Replies

Cisco 3850x Very High CPU - Stack-mgr Process

TOM FRANCHINA
Level 1
Level 1

We are getting high cpu on at least 2 of our 3850x Stacks. Spoke to TAC he noted:

CSCuo14511    fed and stack-mgr causing High CPU on 3850

Suggesting to upgrade to 03.03.04SE

Has anyone seen this and solved it without an upgrade.

Thanks,

Tom

 

ST3-Stack1-3850#sho proc cpu sort | e 0.00
Core 0: CPU utilization for five seconds: 94%; one minute: 92%;  five minutes: 90%
Core 1: CPU utilization for five seconds: 94%; one minute: 93%;  five minutes: 94%
Core 2: CPU utilization for five seconds: 98%; one minute: 96%;  five minutes: 93%
Core 3: CPU utilization for five seconds: 80%; one minute: 83%;  five minutes: 89%
PID    Runtime(ms) Invoked  uSecs  5Sec     1Min     5Min     TTY   Process
5719   3399767     25647721 923    49.61    49.34    49.07    0     stack-mgr
5717   2287217     89195731 434    27.47    27.21    26.97    0     fed
10243  733637      16770188 371    13.71    13.54    13.70    34816 iosd
6250   40832       82186167 665    0.78     0.60     0.59     0     pdsd
10239  1299808     36806083 15     0.24     0.12     0.11     0     wcm
6261   692231      40238409 871    0.10     0.06     0.05     0     cpumemd
19     974400      53966893 33     0.05     0.03     0.04     0     sirq-net-rx/1
43     2452848     54120184 28     0.05     0.04     0.05     0     sirq-net-rx/3
5718   115760      32235590 20     0.05     0.09     0.10     0     platform_mgr
10240  1899130     59100638 3      0.05     0.04     0.02     0     table_mgr

ST3-Stack1-3850#sho ver
Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.03.01SE RELEASE SOFTWARE (fc1)

Switch Ports Model              SW Version        SW Image              Mode
------ ----- -----              ----------        ----------            ----
*    1 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     2 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     3 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     4 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     5 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     6 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     7 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     8 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL
     9 56    WS-C3850-48P       03.03.01SE        cat3k_caa-universalk9 INSTALL

 

41 Replies 41

Hi

well, our experience is that whatever IOS-XE version is running, after 2-3 months cpu goes up to 31% again. 

When it starts hitting 85% we re reloading the stacks in service windows. 

Thats our fix.....

Hi Ton,

This is not a proper solution. have you checked with cisco TAC ? 

Oh yeah. 

They claimed it was a station generating excessive broadcast traffic. 

So i brought down all ports and consoled into the stack and ran the debug commands again. 

It still was a station generating excessive bc traffic they said. (with all ports down you know?) 

I gave up on this plaform really. Too many issues last 2 years

We ve even found ISL issues in the logging, while dot1Q is now default....  

Hey,

I have the same problem. Any news? I tried the "ipv6 mld snooping" but that didn't work.

Sw#show process cpu detail process fed sorted | ex 0.0
Core 0: CPU utilization for five seconds: 99%; one minute: 98%; five minutes: 97%
Core 1: CPU utilization for five seconds: 99%; one minute: 95%; five minutes: 96%
Core 2: CPU utilization for five seconds: 99%; one minute: 89%; five minutes: 95%
Core 3: CPU utilization for five seconds: 88%; one minute: 82%; five minutes: 93%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5712 L 1858252 4132141 497 26.55 26.90 26.57 1088 fed
5712 L 3 12349 3193139 1511072 0 24.21 24.66 24.62 0 PunjectRx
5712 L 0 6152 2835105 6395961 0 0.59 0.85 0.75 0 fed-ots-main
5712 L 0 12350 2855645 3551507 0 0.29 0.25 0.22 0 PunjectTx
5712 L 0 10690 386394 2256693 0 0.24 0.13 0.10 0 Xcvr
5712 L 0 6147 238130 6694802 0 0.20 0.21 0.17 1088 CMI default xdm

Cheers,
Vasco

Hi,

I have the same problem as everyone else.

I wonder if someone has a proper solution at the moment?

We are using 03.03.05SE . Any ideas if upgrading to 3.06.05 as suggested will resolve the problem?

Hello,

Have been running 3.06.05 since August and haven't seen the issue since. Running Lan Base if that helps.

Cisco TAC have advised this is a cosmetic bug which will be fix in yet to be released versions 3.6.6E and 3.7.5E. But for now a reload of the stack is a temporary fix as cpu will creep again after few months.

Cosmetic bug id is CSCuz57493 - High CPU observed in punjectrx fed-ots-main thread. this will be modified soon to include stack-mgr – replenish OOB/OOBnd RX

Thanks

Since I have upgraded all my 3850s to 03.06.03E, I have an uptime of 8 weeks, 6 days, 9 hours, 9 minutes and my cpu looks like this now;

This stack has mild usage on it, but was spiked out at 98-99% previously,

Core 0: CPU utilization for five seconds: 3%; one minute: 5%; five minutes: 5%
Core 1: CPU utilization for five seconds: 1%; one minute: 0%; five minutes: 0%
Core 2: CPU utilization for five seconds: 0%; one minute: 1%; five minutes: 1%
Core 3: CPU utilization for five seconds: 1%; one minute: 2%; five minutes: 1%

This is a stack with heavier usage on it, but was also running 98-99% prior,

Core 0: CPU utilization for five seconds: 26%; one minute: 18%; five minutes: 17%
Core 1: CPU utilization for five seconds: 14%; one minute: 11%; five minutes: 11%
Core 2: CPU utilization for five seconds: 8%; one minute: 8%; five minutes: 7%
Core 3: CPU utilization for five seconds: 9%; one minute: 6%; five minutes: 6%

Hi Guys, 

 

I am having similar issue on one of my switch, see the version below

 03.07.03E

 

I did a reset, it fixes the problem but its building up again, any help please?

Hi!

 

I have the same issue with the same version. Did you find a solution?

 

Best regards.

If anyone face the issue with this version (3.7.3.E=, it is a cosmetic Bug, as it has no impact in the performance.

 

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuz57493/?reffering_site=dumpcr

 

 

The workaround is to reload the stack, but the final solution is to perform an upgrade to 3.7.5.E

 

Best regards.

Working with TAC and they are suggesting upgrading to 3.07.02.E.  I have upgraded (2) 3850's.  So far after 2 weeks they are running good.  Will wait at least a month to verify however.  I have seen this same situation happen before where the issue does not reoccur for several weeks.

-CPU broadcast queue is congested, and that ARP is around 60% of the capture

-Still see ARP traffic hitting the CPU regardless of whether or not the SVI is configured anymore, is because this platform by design allows 200pps (packets per second) of this kind of traffic.

-a few duplicate ARP packets due to software bug CSCur30273 – “3850 duplicates pass-through ARP packets”.

Bruno... No new news on this.

Aninda... do you have any interest in resolving this?

We downloaded 3.6 but got errors during the upgrade. Downloaded it again with the same results. It did boot up... but I am not going to even try it until I get a clean upgrade.

All my 3850 are stacks... some 9 deep. Still at 90+ 

Thanks,

Tom

 

 

 

 

Hey Tom,

 

It is going to be very difficult troubleshooting this issue over a medium like this. I'd suggest that everyone on here (facing this problem) open a (or another in case of Tom) TAC case for live troubleshooting.

 

As I had stated earlier, there may be genuine, underlying network issues that can cause stack-mgr to stay up. If that is not the case, and you're not seeing any impact from your cores running very high, then you could be hitting an internal defect where the show process cpu output is misreporting the CPU values.

 

You can post your SRs here and I can take a look at them as well.

 

Additionally, may I know what errors you encountered while trying to upgrade to 3.6.0?

 

Regards,

Aninda

Aninda,

We got the error during the copy from USB port to Flash: The copy did finish. We can try again if you need to exact error. I believe it was a checksum error

Tom

 

Review Cisco Networking for a $25 gift card