Hey, 1. NTP - Think you are

CrackedJack1 · ‎04-24-2014

I've got a 3850 that seems to be misbehaving.

I'm running version 3.3.1 and short of trying to schedule a reload of the stack, I'm not sure what else I can do.

Right now I have three problems:

- NTP says it’s synced but time is almost a minute out of sync (saw a log message saying the 2nd switch in the stack had time running backwards as well). Other switches with the same ntp server seem to be fine.

- CPU is at 40% (caused by fed process) and there doesn’t seem to be a good reason why. Started about 2 months ago in the middle of the night. Have had a case open for a while with TAC but can't seem to find the cause (maybe this is the problem???: CSCuo14511)

- A bit of unicast flooding because the switch is not learning MAC addresses properly (this was the subject of another TAC case a few months back that I closed because the problem went away.) This time the problem was probably triggered by a loop on another switch in the network but after 2 weeks, I don’t see why the 3850 still isn’t working properly. I found this bug which is somewhat related, CSCuj51372, but I don't think that's quite the same thing as my issue since, according to what I can see in my network monitoring graphs, all ports got the spike in traffic. The workaround seems to "switchport block unicast". I see the traffic go way down on the port I tried it on. Not sure if that messes up anything else though.

If anyone has ideas on how to fix these issues, I'm open to suggestions otherwise I'll have to reload and cross my fingers.

Thanks

Richard Primm · ‎04-24-2014

Hey,

1. NTP - Think you are hitting CSCug75425 (fix will be in 3.6 release - may 2014) *we may have a fix in 3.3.3 which is due out next week, however I cannot confirm this.

2. High CPU is tough to troubleshoot without seeing however a colleague of mine just released a great doc on 3850 high cpu troubleshooting. CLICK HERE

In regards to CSCuo14511 - do you also see "Stack-mgr" running hot?

3. Its not CSCuj51372 since you are already running a fixed version. To be honest, I'm not sure with the data listed. Sorry.

Luke

CrackedJack1 · ‎04-25-2014

Hi Luke,

1) That NTP bug could be it even though I'm off by more than a few seconds. I did try adding/removing NTP settings and it didn't help so I'll keep my fingers crossed.

2) For the high cpu, the person working on my case did a bunch of that trace stuff and other than saying it was coming from another switch, there doesn't seem to be a specific cause. Fed is by far the highest consumer with Iosd coming in 2nd. Everything else is not even at 1%. Below is detailed output from fed and iosd as an example.

PID    T C TID    Runtime(ms) Invoked uSecs 5Sec      1Min     5Min     TTY   Process
                                               (%)       (%)      (%)
5711   L           3802720     4025757 599    30.72     31.01   31.08   1088 fed
5711   L 3 10687 1352269     2098074 0      23.92     23.77   23.83   0     PunjectRx
5711   L 0 6150   2568190     6030774 0      2.30      2.65    2.65    0     fed-ots-main
5711   L 2 10688 768022      4880154 0      2.25      2.22    2.25    0     PunjectTx
5711   L 1 6176   1321212     2834510 0      0.62      0.65    0.62    0     IntrDrv
5711   L 2 6153   2344817     8712912 0      0.48      0.49    0.48    0     fed-ots-nfl
5711   L 2 6146   3131259     5385475 0      0.43      0.58    0.57    1088 CMI default xdm
5711   L 0 6144   3102202     4169519 0      0.38      0.30    0.30    1088 fed

PID    T C TID    Runtime(ms) Invoked uSecs 5Sec      1Min     5Min     TTY   Process
                                               (%)       (%)      (%)
10184 L           538870      1783225 470    9.79      9.39    9.25    34816 iosd
10184 L 1 10184 3327658     3568216 0      6.98      6.65    6.55    34816 iosd
10184 L 2 10849 2440189     2270519 0      2.67      2.48    2.43    0     iosd.fastpath
10184 L 3 10850 3175047     2305408 0      0.14      0.25    0.27    0     CMI Thread
10184 L 1 10851 183800      8898815 0      0.00      0.00    0.00    0     iosd.monitor
10184 L 1 10852 2110        17163   0      0.00      0.00    0.00    0     iosd.aux
241    I           2528248     3826105 0      9.88      9.22    9.11    0       Spanning Tree
30     I           3637774     1229465 0      5.88      5.22    4.66    0       ARP Input
132    I           2101996     1678705 0      2.66      2.99    2.99    0       NGWC Learning Proce
22     I           1806614     2053026 0      1.77      1.00    1.00    0       CMI IOSd task
152    I           3482189     1143618 0      0.77      1.00    0.99    0       ARP HA
271    I           3252864     3998120 0      0.77      1.22    1.44    0       IGMPSN
126    I           3031858     1533070 0      0.66      0.33    0.33    0       cpf_process_tpQ
138    I           572956      1769025 0      0.33      0.33    0.33    0       IPC Bootstrap
204    I           577370      8349982 0      0.22      0.11    0.22    0       Tunnel IOSd shim DB
162    I           453340      6041985 0      0.22      0.00    0.00    0       ngpm main process
31     I           364600      9222499 0      0.22      0.00    0.00    0       ARP Background
227    I           229013      4540678 0      0.22      0.88    0.88    0       IP Input

As I said in my original post, short of rebooting, i'm not sure what to do next

Stipriaan · ‎08-12-2014

I have a similar problem, also running version 3.3.1.

In my case it's the "stack-mgr" process running at 49% CPU.

It started on July 23rd when the CPU went from a normal 7-10% average up to 33% around 12:00 and then on the 25th (2 days later) it went from 33% to 57% at 22:00.

Since it's my server stack (4 units) rebooting is not an option.

Ton V Engelen · ‎08-12-2014

Hi

best thing to do is upgrade to v3.3.3

We ve had similar issues. TAC suggested to upgrade to v3.3.3 which solved the lot.

No high cpu anymore and proper mac learning now.

problems with 3850