Hi Spiro,

lertasnet · ‎06-05-2015

Hi,

While trying to find a fix for our problem described here https://supportforums.cisco.com/discussion/12521766/high-cpu-utilization-3650

I had an idea to null route all unused IPs that were causing those ARP requests that were found to be the cause of the high cpu utilization on our 3650's.

So I did just that on our 3850 and it jumped the cpu utilization from 32% to 57%. I immediately removed the null routes but the high cpu utilization remained to 56-57%, as if I hadn't done anything.

PS. The routes were nulled like this: ip route x.x.x.x 255.255.255.255 null 0

so all were added with a /32 and not with some /25 or /24 prefix.

So now we have core and edge on high cpu load.

Question #1: Why do arp requests generate such a high load? (we have 6 /24 ranges)

Question #2: If an action causes a high cpu utilization, cancelling that action shouldn't then cancel the effect?

Question #3: How to deal with arp requests to non-existent ips that generate high cpu utilization (that never comes back down even when the requests stop)?

Below the usual stuff that the good folks need to identify this:

Switch Ports Model              SW Version        SW Image              Mode
------ ----- -----              ----------        ----------            ----
*    1 32    WS-C3850-24T       03.03.05SE        cat3k_caa-universalk9 INSTALL

#show proc cpu detailed | ex 0.0
Core 0: CPU utilization for five seconds: 84%; one minute: 87%; five minutes: 86%
Core 1: CPU utilization for five seconds: 57%; one minute: 64%; five minutes: 56%
Core 2: CPU utilization for five seconds: 94%; one minute: 39%; five minutes: 37%
Core 3: CPU utilization for five seconds: 37%; one minute: 41%; five minutes: 48%
PID     T C TID     Runtime(ms) Invoked   uSecs 5Sec   1Min 5Min TTY   Process
                                                  (%)    (%) (%)
5703    L            2791053     166558658 618    35.0   32.3 32.2 1088 fed
5703    L 1 6135    1580340     187915994 0      1.25   0.82 0.86 1088 fed
5703    L 1 6141    3098676     636145198 0      0.14   0.25 0.25 0     fed-ots-main
6239    L            3086455     391018952 124    19.2   20.2 20.5 0     pdsd
6239    L 2 8468    2778170     354544604 0      19.5   20.1 20.5 0     pdsd
6241    L 3 7347    11860       92079     0      0.83   0.83 0.83 0     CMI default xdm
6247    L            1298531     179872884 373    1.30   0.82 0.69 0     ffm
6247    L 3 6247    1993712     375943583 0      0.19   0.45 0.38 0     ffm
6247    L 3 8219    290493      110730035 0      0.39   0.32 0.28 0     CMI default xdm
8566    L            3434365     279734906 24     0.14   0.14 0.14 0     wcm
8570    L 3 8570    816883      101914872 0      2.49   2.29 2.23 0     iosd
7       I            433269      4113676   0      2.55   0.33 0.22 0       Check heaps
30      I            2607528     233487409 0      0.11   0.22 0.22 0       ARP Input
203     I            2912930     160532751 0      0.77   0.44 0.44 0       Tunnel IOSd shim
225     I            2542144     403872286 0      0.77   1.33 1.44 0       IP ARP Retry Ager
226     I            1670844     481494139 0      0.77   0.99 0.88 0       IP Input
314     I            3734041     161380536 0      0.77   0.66 0.77 0       IP SLAs XOS Event
319     I            3371874     112187598 0      0.66   0.22 0.11 0       ADJ resolve proce
394     I            1756890     292726188 0      0.11   0.22 0.11 0       PDU DISPATCHER
399     I            3373374     587531428 0      0.33   0.66 0.66 0       IP SNMP

Just by executing this command, the cpu jumped another 5%....

Any ideas?

Many thanks!

Lefteris

lertasnet · ‎04-23-2016

If anyone is interested....

After about 4 months of running with high cpu load, the switch decided to go back to 16% WITHOUT any action from our part. It just went down and it is still there....

Hope this helps anyone with the same issue

spapageorgiou · ‎07-22-2016

Hi Lefteri,

I have the same issue on my 3650. It is exactly the same as you describe it. The funny thing is that my 3650 is actually a stack of two switches and only one switch has the problem (Its the master of the stack).

Have you opened a case? Have you made in other conclusions about the issue? Did you have the chance to try another IOS?

Thanx in advance,

spiros

lertasnet · ‎07-23-2016

Hi Spiro,

Unfortunately I have received no respone on this issue. We have the same problem (but not as high) on another device (3650) withthe same IOS.
On our 3850 it has now gotten worse. We have 80% cpu load...
As fas as I can tell, the problem is related to arp requests to IPs that are not live.
On the 3850 we have 8 /24 ranges and about 40% of the ips are active. The higher the amount of live ips, the lower the cpu load.... I have no solution to this problem yet. I am hoping an ios upgrade to fix the issue but not holding my breath.
Strangely, we do NOT have this issue on routers (3825 etc)

Any similarities from your end? Please share!

Please note that we are running IP-advanced-services image and are running BGP on both devices. On two 3650s that are used as core switches and not running L3 we have absolutely no issues!

Looking forward to your input :)

Thx

Lefteris

spapageorgiou · ‎07-24-2016

Yia soy Lefteri,

Well, my case is a stack of:

Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
3 52 WS-C3650-48TD 03.03.04SE cat3k_caa-universalk9 INSTALL
* 4 52 WS-C3650-48TD 03.03.04SE cat3k_caa-universalk9 INSTALL

uptime : 1 year, 42 weeks, 3 days, 4 hours, 54 minutes

- I have about 30 SVIs (vlan l3 ifs)

- about 50 lines of ACL statements on applied on SVI interfaces

- OSPF, static

- about 240 entries in the arp table, few of them in unknown state

- I have in all interfaces "ip device tracking maximum 0" in order to disable device tracking (in a later IOS you can disable it globally).

- various policing/shaping on interfaces

- my CPU right now is:

#sh proc cpu
Core 0: CPU utilization for five seconds: 95%; one minute: 95%; five minutes: 95%
Core 1: CPU utilization for five seconds: 92%; one minute: 94%; five minutes: 95%
Core 2: CPU utilization for five seconds: 93%; one minute: 96%; five minutes: 95%
Core 3: CPU utilization for five seconds: 97%; one minute: 94%; five minutes: 94%

- The CPU load has been increased in big steps of about 20% in about two months. I have also seen once, CPU load falling a 20%. When there is an increase/decrease of CPU load (a step), it stays there for weeks, even months. So I want to say that it is not going up and down as a normal CPU should do.

- FED is the problem:

#show processes cpu detailed process fed sorted | ex 0.0
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5665 L 229196 1441256 466 72.75 73.16 73.61 1088 fed
5665 L 1 6100 1810519 1257976 0 24.12 24.19 24.30 0 fed-ots-main
5665 L 2 11007 1280626 2181769 0 14.24 12.63 12.92 0 XcvrPoll

- The CPU load is supposed to be punted packets to CPU.

#show platform punt client
tag buffer jumbo fallback packets received failures
alloc free bytes conv buf
27 0/1024/2048 0/5 0/5 0 0 0 0 0
65536 0/1024/1600 0/0 0/512 1012528394 1225732032 3754655604 0 0
65537 0/ 512/1600 0/0 0/512 28843424 28843424 3106482325 0 0
65538 0/ 5/5 0/0 0/5 0 0 0 0 0
65539 1/2048/1600 0/16 0/512 351496405 351662096 1015006362 0 0
65540 0/ 128/1600 0/8 0/0 11983417 11983417 2407384532 0 0
65541 0/ 128/1600 0/16 0/32 120981077 120981077 1710593502 0 0
65542 0/ 768/1600 0/4 0/0 19806318 162558624 1585792442 0 0
65544 0/ 96/1600 0/4 0/0 0 0 0 0 0
65545 0/ 96/1600 0/8 0/32 0 0 0 0 0
65546 0/ 512/1600 0/32 0/512 1516512715 1519497732 1064396594 0 0
65547 0/ 96/1600 0/8 0/32 0 0 0 0 0
65548 0/ 512/1600 0/32 0/256 2249570784 2249568845 1396835480 0 2
65551 0/ 512/1600 0/0 0/256 7 7 420 0 0
65556 0/ 16/1600 0/4 0/0 0 0 0 0 0
65557 0/ 16/1600 0/4 0/0 0 0 0 0 0
65558 0/ 16/1600 0/4 0/0 2610521 2610521 180203510 0 282
65559 0/ 16/1600 0/4 0/0 45229293 45229293 3911166554 0 3
65560 0/ 16/1600 0/4 0/0 1554691 1554691 124004004 0 4711
65561 0/ 512/1600 0/0 0/128 407124369 439136899 2988194120 0 7
65562 0/ 512/1600 0/0 0/256 0 0 0 0 0
65563 0/ 512/1600 0/0 0/256 0 0 0 0 0
65565 0/ 512/1600 0/16 0/256 0 0 0 0 0
65566 0/ 512/1600 0/16 0/256 0 0 0 0 0
65567 0/ 512/1600 0/16 0/256 0 0 0 0 0
65568 0/ 512/1600 0/16 0/256 0 0 0 0 0
65583 0/ 1/1 0/0 0/0 0 0 0 0 0
131071 0/ 96/1600 0/4 0/0 0 0 0 0 0
fallback pool: 0/1500/1600
jumbo pool: 0/128/9300

- My tcam usage is well under limits.

- Since I have CPU load big steps upwards in specific moment in time, I queried my logs (from all the devices, not only the 3650) in order to find if some event happened at that time and triggered the CPU load. I found nothing.

What I believe:

It is a BUUUGGG... It can't be the arp requests. Even your 8*/24 is not that much to have the switch in constant load for days. As you can see on the output for CPU punted packets, I have many packets that go to CPU. The switch has constant load but it is very responsive. It can even be a cosmetic bug that falsely shows high CPU load.

We should diff the 'punted to CPU packets' counters on order to derive rates and capture some packets to see if we can come to a conclusion.

Have you opened a cisco tac case?

Cheers,

Sp

lertasnet · ‎07-25-2016

Geia sou Spiro :)

I dont have TAC access so no support case.

You are right, it IS a BUG! I am not sure about the arp, in fact I am not sure about anything. It could very well be cosmetic (cpu load) as my switches are also very responsive.

My team here has gone really deep and captured a lot of packets (hence our idea about the arp), but nothing was conclusive... In the lab, we simulated a scenario where there were a lot of ips and a lot of arp requests to non active IPs and the result was the same.

We both have the same IOS 03.03.05SE , so lets hope for an upgrade which could fix this bug!

Opws leei kai enas filos mou, oi indoi grafoun kwdika me ta podia :D

Good luck and keep me posted if you find anything!

BR,

Lefteris

3850 High Cpu - Null Routing increases cpu load by 20%