04-24-2014 04:56 AM - edited 03-07-2019 07:12 PM
I've got a 3850 that seems to be misbehaving.
I'm running version 3.3.1 and short of trying to schedule a reload of the stack, I'm not sure what else I can do.
Right now I have three problems:
- NTP says it’s synced but time is almost a minute out of sync (saw a log message saying the 2nd switch in the stack had time running backwards as well). Other switches with the same ntp server seem to be fine.
- CPU is at 40% (caused by fed process) and there doesn’t seem to be a good reason why. Started about 2 months ago in the middle of the night. Have had a case open for a while with TAC but can't seem to find the cause (maybe this is the problem???: CSCuo14511)
- A bit of unicast flooding because the switch is not learning MAC addresses properly (this was the subject of another TAC case a few months back that I closed because the problem went away.) This time the problem was probably triggered by a loop on another switch in the network but after 2 weeks, I don’t see why the 3850 still isn’t working properly. I found this bug which is somewhat related, CSCuj51372, but I don't think that's quite the same thing as my issue since, according to what I can see in my network monitoring graphs, all ports got the spike in traffic. The workaround seems to "switchport block unicast". I see the traffic go way down on the port I tried it on. Not sure if that messes up anything else though.
If anyone has ideas on how to fix these issues, I'm open to suggestions otherwise I'll have to reload and cross my fingers.
Thanks
04-24-2014 07:32 PM
Hey,
1. NTP - Think you are hitting CSCug75425 (fix will be in 3.6 release - may 2014) *we may have a fix in 3.3.3 which is due out next week, however I cannot confirm this.
2. High CPU is tough to troubleshoot without seeing however a colleague of mine just released a great doc on 3850 high cpu troubleshooting. CLICK HERE
In regards to CSCuo14511 - do you also see "Stack-mgr" running hot?
3. Its not CSCuj51372 since you are already running a fixed version. To be honest, I'm not sure with the data listed. Sorry.
Luke
04-25-2014 10:50 AM
Hi Luke,
1) That NTP bug could be it even though I'm off by more than a few seconds. I did try adding/removing NTP settings and it didn't help so I'll keep my fingers crossed.
2) For the high cpu, the person working on my case did a bunch of that trace stuff and other than saying it was coming from another switch, there doesn't seem to be a specific cause. Fed is by far the highest consumer with Iosd coming in 2nd. Everything else is not even at 1%. Below is detailed output from fed and iosd as an example.
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5711 L 3802720 4025757 599 30.72 31.01 31.08 1088 fed
5711 L 3 10687 1352269 2098074 0 23.92 23.77 23.83 0 PunjectRx
5711 L 0 6150 2568190 6030774 0 2.30 2.65 2.65 0 fed-ots-main
5711 L 2 10688 768022 4880154 0 2.25 2.22 2.25 0 PunjectTx
5711 L 1 6176 1321212 2834510 0 0.62 0.65 0.62 0 IntrDrv
5711 L 2 6153 2344817 8712912 0 0.48 0.49 0.48 0 fed-ots-nfl
5711 L 2 6146 3131259 5385475 0 0.43 0.58 0.57 1088 CMI default xdm
5711 L 0 6144 3102202 4169519 0 0.38 0.30 0.30 1088 fed
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
10184 L 538870 1783225 470 9.79 9.39 9.25 34816 iosd
10184 L 1 10184 3327658 3568216 0 6.98 6.65 6.55 34816 iosd
10184 L 2 10849 2440189 2270519 0 2.67 2.48 2.43 0 iosd.fastpath
10184 L 3 10850 3175047 2305408 0 0.14 0.25 0.27 0 CMI Thread
10184 L 1 10851 183800 8898815 0 0.00 0.00 0.00 0 iosd.monitor
10184 L 1 10852 2110 17163 0 0.00 0.00 0.00 0 iosd.aux
241 I 2528248 3826105 0 9.88 9.22 9.11 0 Spanning Tree
30 I 3637774 1229465 0 5.88 5.22 4.66 0 ARP Input
132 I 2101996 1678705 0 2.66 2.99 2.99 0 NGWC Learning Proce
22 I 1806614 2053026 0 1.77 1.00 1.00 0 CMI IOSd task
152 I 3482189 1143618 0 0.77 1.00 0.99 0 ARP HA
271 I 3252864 3998120 0 0.77 1.22 1.44 0 IGMPSN
126 I 3031858 1533070 0 0.66 0.33 0.33 0 cpf_process_tpQ
138 I 572956 1769025 0 0.33 0.33 0.33 0 IPC Bootstrap
204 I 577370 8349982 0 0.22 0.11 0.22 0 Tunnel IOSd shim DB
162 I 453340 6041985 0 0.22 0.00 0.00 0 ngpm main process
31 I 364600 9222499 0 0.22 0.00 0.00 0 ARP Background
227 I 229013 4540678 0 0.22 0.88 0.88 0 IP Input
As I said in my original post, short of rebooting, i'm not sure what to do next
08-12-2014 07:07 AM
I have a similar problem, also running version 3.3.1.
In my case it's the "stack-mgr" process running at 49% CPU.
It started on July 23rd when the CPU went from a normal 7-10% average up to 33% around 12:00 and then on the 25th (2 days later) it went from 33% to 57% at 22:00.
Since it's my server stack (4 units) rebooting is not an option.
08-12-2014 07:15 AM
Hi
best thing to do is upgrade to v3.3.3
We ve had similar issues. TAC suggested to upgrade to v3.3.3 which solved the lot.
No high cpu anymore and proper mac learning now.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: