06-05-2015 05:46 AM - edited 03-08-2019 12:25 AM
Hi,
While trying to find a fix for our problem described here https://supportforums.cisco.com/discussion/12521766/high-cpu-utilization-3650
I had an idea to null route all unused IPs that were causing those ARP requests that were found to be the cause of the high cpu utilization on our 3650's.
So I did just that on our 3850 and it jumped the cpu utilization from 32% to 57%. I immediately removed the null routes but the high cpu utilization remained to 56-57%, as if I hadn't done anything.
PS. The routes were nulled like this: ip route x.x.x.x 255.255.255.255 null 0
so all were added with a /32 and not with some /25 or /24 prefix.
So now we have core and edge on high cpu load.
Question #1: Why do arp requests generate such a high load? (we have 6 /24 ranges)
Question #2: If an action causes a high cpu utilization, cancelling that action shouldn't then cancel the effect?
Question #3: How to deal with arp requests to non-existent ips that generate high cpu utilization (that never comes back down even when the requests stop)?
Below the usual stuff that the good folks need to identify this:
Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
* 1 32 WS-C3850-24T 03.03.05SE cat3k_caa-universalk9 INSTALL
#show proc cpu detailed | ex 0.0
Core 0: CPU utilization for five seconds: 84%; one minute: 87%; five minutes: 86%
Core 1: CPU utilization for five seconds: 57%; one minute: 64%; five minutes: 56%
Core 2: CPU utilization for five seconds: 94%; one minute: 39%; five minutes: 37%
Core 3: CPU utilization for five seconds: 37%; one minute: 41%; five minutes: 48%
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5703 L 2791053 166558658 618 35.0 32.3 32.2 1088 fed
5703 L 1 6135 1580340 187915994 0 1.25 0.82 0.86 1088 fed
5703 L 1 6141 3098676 636145198 0 0.14 0.25 0.25 0 fed-ots-main
6239 L 3086455 391018952 124 19.2 20.2 20.5 0 pdsd
6239 L 2 8468 2778170 354544604 0 19.5 20.1 20.5 0 pdsd
6241 L 3 7347 11860 92079 0 0.83 0.83 0.83 0 CMI default xdm
6247 L 1298531 179872884 373 1.30 0.82 0.69 0 ffm
6247 L 3 6247 1993712 375943583 0 0.19 0.45 0.38 0 ffm
6247 L 3 8219 290493 110730035 0 0.39 0.32 0.28 0 CMI default xdm
8566 L 3434365 279734906 24 0.14 0.14 0.14 0 wcm
8570 L 3 8570 816883 101914872 0 2.49 2.29 2.23 0 iosd
7 I 433269 4113676 0 2.55 0.33 0.22 0 Check heaps
30 I 2607528 233487409 0 0.11 0.22 0.22 0 ARP Input
203 I 2912930 160532751 0 0.77 0.44 0.44 0 Tunnel IOSd shim
225 I 2542144 403872286 0 0.77 1.33 1.44 0 IP ARP Retry Ager
226 I 1670844 481494139 0 0.77 0.99 0.88 0 IP Input
314 I 3734041 161380536 0 0.77 0.66 0.77 0 IP SLAs XOS Event
319 I 3371874 112187598 0 0.66 0.22 0.11 0 ADJ resolve proce
394 I 1756890 292726188 0 0.11 0.22 0.11 0 PDU DISPATCHER
399 I 3373374 587531428 0 0.33 0.66 0.66 0 IP SNMP
Just by executing this command, the cpu jumped another 5%....
Any ideas?
Many thanks!
Lefteris
04-23-2016 12:38 AM
If anyone is interested....
After about 4 months of running with high cpu load, the switch decided to go back to 16% WITHOUT any action from our part. It just went down and it is still there....
Hope this helps anyone with the same issue
07-22-2016 11:22 AM
Hi Lefteri,
I have the same issue on my 3650. It is exactly the same as you describe it. The funny thing is that my 3650 is actually a stack of two switches and only one switch has the problem (Its the master of the stack).
Have you opened a case? Have you made in other conclusions about the issue? Did you have the chance to try another IOS?
Thanx in advance,
spiros
07-23-2016 02:49 AM
Hi Spiro,
Unfortunately I have received no respone on this issue. We have the same problem (but not as high) on another device (3650) withthe same IOS.
On our 3850 it has now gotten worse. We have 80% cpu load...
As fas as I can tell, the problem is related to arp requests to IPs that are not live.
On the 3850 we have 8 /24 ranges and about 40% of the ips are active. The higher the amount of live ips, the lower the cpu load.... I have no solution to this problem yet. I am hoping an ios upgrade to fix the issue but not holding my breath.
Strangely, we do NOT have this issue on routers (3825 etc)
Any similarities from your end? Please share!
Please note that we are running IP-advanced-services image and are running BGP on both devices. On two 3650s that are used as core switches and not running L3 we have absolutely no issues!
Looking forward to your input :)
Thx
Lefteris
07-24-2016 12:18 AM
Yia soy Lefteri,
Well, my case is a stack of:
Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
3 52 WS-C3650-48TD 03.03.04SE cat3k_caa-universalk9 INSTALL
* 4 52 WS-C3650-48TD 03.03.04SE cat3k_caa-universalk9 INSTALL
uptime : 1 year, 42 weeks, 3 days, 4 hours, 54 minutes
- I have about 30 SVIs (vlan l3 ifs)
- about 50 lines of ACL statements on applied on SVI interfaces
- OSPF, static
- about 240 entries in the arp table, few of them in unknown state
- I have in all interfaces "ip device tracking maximum 0" in order to disable device tracking (in a later IOS you can disable it globally).
- various policing/shaping on interfaces
- my CPU right now is:
#sh proc cpu
Core 0: CPU utilization for five seconds: 95%; one minute: 95%; five minutes: 95%
Core 1: CPU utilization for five seconds: 92%; one minute: 94%; five minutes: 95%
Core 2: CPU utilization for five seconds: 93%; one minute: 96%; five minutes: 95%
Core 3: CPU utilization for five seconds: 97%; one minute: 94%; five minutes: 94%
- The CPU load has been increased in big steps of about 20% in about two months. I have also seen once, CPU load falling a 20%. When there is an increase/decrease of CPU load (a step), it stays there for weeks, even months. So I want to say that it is not going up and down as a normal CPU should do.
- FED is the problem:
#show processes cpu detailed process fed sorted | ex 0.0
PID T C TID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
(%) (%) (%)
5665 L 229196 1441256 466 72.75 73.16 73.61 1088 fed
5665 L 1 6100 1810519 1257976 0 24.12 24.19 24.30 0 fed-ots-main
5665 L 2 11007 1280626 2181769 0 14.24 12.63 12.92 0 XcvrPoll
- The CPU load is supposed to be punted packets to CPU.
#show platform punt client
tag buffer jumbo fallback packets received failures
alloc free bytes conv buf
27 0/1024/2048 0/5 0/5 0 0 0 0 0
65536 0/1024/1600 0/0 0/512 1012528394 1225732032 3754655604 0 0
65537 0/ 512/1600 0/0 0/512 28843424 28843424 3106482325 0 0
65538 0/ 5/5 0/0 0/5 0 0 0 0 0
65539 1/2048/1600 0/16 0/512 351496405 351662096 1015006362 0 0
65540 0/ 128/1600 0/8 0/0 11983417 11983417 2407384532 0 0
65541 0/ 128/1600 0/16 0/32 120981077 120981077 1710593502 0 0
65542 0/ 768/1600 0/4 0/0 19806318 162558624 1585792442 0 0
65544 0/ 96/1600 0/4 0/0 0 0 0 0 0
65545 0/ 96/1600 0/8 0/32 0 0 0 0 0
65546 0/ 512/1600 0/32 0/512 1516512715 1519497732 1064396594 0 0
65547 0/ 96/1600 0/8 0/32 0 0 0 0 0
65548 0/ 512/1600 0/32 0/256 2249570784 2249568845 1396835480 0 2
65551 0/ 512/1600 0/0 0/256 7 7 420 0 0
65556 0/ 16/1600 0/4 0/0 0 0 0 0 0
65557 0/ 16/1600 0/4 0/0 0 0 0 0 0
65558 0/ 16/1600 0/4 0/0 2610521 2610521 180203510 0 282
65559 0/ 16/1600 0/4 0/0 45229293 45229293 3911166554 0 3
65560 0/ 16/1600 0/4 0/0 1554691 1554691 124004004 0 4711
65561 0/ 512/1600 0/0 0/128 407124369 439136899 2988194120 0 7
65562 0/ 512/1600 0/0 0/256 0 0 0 0 0
65563 0/ 512/1600 0/0 0/256 0 0 0 0 0
65565 0/ 512/1600 0/16 0/256 0 0 0 0 0
65566 0/ 512/1600 0/16 0/256 0 0 0 0 0
65567 0/ 512/1600 0/16 0/256 0 0 0 0 0
65568 0/ 512/1600 0/16 0/256 0 0 0 0 0
65583 0/ 1/1 0/0 0/0 0 0 0 0 0
131071 0/ 96/1600 0/4 0/0 0 0 0 0 0
fallback pool: 0/1500/1600
jumbo pool: 0/128/9300
- My tcam usage is well under limits.
- Since I have CPU load big steps upwards in specific moment in time, I queried my logs (from all the devices, not only the 3650) in order to find if some event happened at that time and triggered the CPU load. I found nothing.
What I believe:
It is a BUUUGGG... It can't be the arp requests. Even your 8*/24 is not that much to have the switch in constant load for days. As you can see on the output for CPU punted packets, I have many packets that go to CPU. The switch has constant load but it is very responsive. It can even be a cosmetic bug that falsely shows high CPU load.
We should diff the 'punted to CPU packets' counters on order to derive rates and capture some packets to see if we can come to a conclusion.
Have you opened a cisco tac case?
Cheers,
Sp
07-25-2016 11:42 PM
Geia sou Spiro :)
I dont have TAC access so no support case.
You are right, it IS a BUG! I am not sure about the arp, in fact I am not sure about anything. It could very well be cosmetic (cpu load) as my switches are also very responsive.
My team here has gone really deep and captured a lot of packets (hence our idea about the arp), but nothing was conclusive... In the lab, we simulated a scenario where there were a lot of ips and a lot of arp requests to non active IPs and the result was the same.
We both have the same IOS 03.03.05SE , so lets hope for an upgrade which could fix this bug!
Opws leei kai enas filos mou, oi indoi grafoun kwdika me ta podia :D
Good luck and keep me posted if you find anything!
BR,
Lefteris
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide