11-25-2014 06:59 AM - edited 03-05-2019 12:14 AM
Hi,
I know it has been discussed many times before, but may I get some new inspiration to solve some problems here.
We've some c6509 with SUP720-3BXL. One of routers has a WS-X6704-10G module with DFC. This is the only card installed besides the SUP720. We're run into some heavy CPU problems regarding the RP:
c6k-05#sh proc cpu sorted CPU utilization for five seconds: 71%/12%; one minute: 77%; five minutes: 78% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 252 70001864 72821359 961 37.88% 34.97% 33.83% 0 Earl NDE Task 12 13535512 7478495 1809 9.67% 7.18% 5.88% 0 ARP Input 354 13418508 655279 20477 7.27% 6.52% 6.47% 0 CEF: IPv4 proces 275 5628156 14433722 389 2.79% 2.47% 2.46% 0 ADJ resolve proc 273 5720936 8782567 651 0.31% 0.56% 2.06% 0 IP Input
SP looks like normal:
c6k-05#remote command switch sh proc cpu sorted CPU utilization for five seconds: 24%/0%; one minute: 24%; five minutes: 25% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 316 5711876 422724 13512 6.55% 3.36% 3.18% 0 Hardware API bac 109 10629720 20123627 528 6.31% 6.87% 6.98% 0 slcp process 3 10847468 988165 10977 5.27% 6.11% 6.13% 0 CEF: IPv4 proces 253 3055016 108213 28231 1.67% 1.70% 1.70% 0 Vlan Statistics
c6k-05.nc#sh proc cpu history 7777666668888888888999998888877777777766666777777777777777 6666888880000044444222220000000000333399999555557777755555 100 90 ***** 80 **** ******************** *************** 70 ********************************************************** 60 ********************************************************** 50 ********************************************************** 40 ********************************************************** 30 ********************************************************** 20 ********************************************************** 10 ********************************************************** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds) 9999988999999999888989899999999889999999999999999999999999 1522389859375081979774546389531690556610500213560563134442 100 * *** ** * * * *#* **** * ** ** 90 **************************##****************************** 80 #############*#########*################################## 70 ########################################################## 60 ########################################################## 50 ########################################################## 40 ########################################################## 30 ########################################################## 20 ########################################################## 10 ########################################################## 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per minute (last 60 minutes) * = maximum CPU% # = average CPU% 9999999999999999999999999999999999999999999999999 9999999999999999999999999999998999999999999999999 100 ************************************************* 90 *****##****************************************** 80 ################################################# 70 ################################################# 60 ################################################# 50 ################################################# 40 ################################################# 30 ################################################# 20 ################################################# 10 ################################################# 0....5....1....1....2....2....3....3....4....4....5....5....6....6....7. 0 5 0 5 0 5 0 5 0 5 0 5 0 CPU% per hour (last 72 hours) * = maximum CPU% # = average CPU%
Our Netflow statistics will consume many of the CPU time, this is clear.
ip flow-cache timeout inactive 10 ip flow-cache timeout active 1 ... mls aging fast time 14 mls aging long 64 mls aging normal 45 mls netflow interface mls netflow usage notify 90 21600 mls flow ip interface-full mls flow ipv6 interface-full mls nde sender version 5
Anything else to search for? This box has a lot of L3 VLANs and many many /24 subnets as secondary addresses inside the SVIs. QoS and COOP is enabled and heavily used. The box receive one full-table (~508k), one partial table (~130k) and two iBGP feeds with ~1k prefixes.
c6k-05#sh int te1/1 | i 5 minute 5 minute input rate 917021000 bits/sec, 576857 packets/sec 5 minute output rate 1021949000 bits/sec, 376221 packets/sec c6k-05.nc#sh int te1/3 | i 5 minute 5 minute input rate 1266207000 bits/sec, 450679 packets/sec 5 minute output rate 2140145000 bits/sec, 1342052 packets/sec
c6k-05#sh ver | i IOS Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(33)SXI14, RELEASE SOFTWARE (fc2)
Don't know if this is CPU related: (messages appears every 2-3 min. in the log)
Nov 25 15:51:45.558 CET: %SYS-3-CPUHOG: Task is running for (204)msecs, more than (200)msecs (14/7),process = BGP Scheduler. -Traceback= 4035A898 4036D818 4036D9AC 4166696C 4036F228 4036F310 4166BB70 4166BB5C
Any suggestions? Thanks in advance.
Thomas
11-26-2014 01:52 AM
Hey Thomas,
Netflow is definitely taking it toll from switch as seen in #show proc cpu output:
252 70001864 72821359 961 37.88% 34.97% 33.83% 0 Earl NDE Task
I suggest opening a TAC for for deeper investigation as switch is also receiving CPU hogging traceback errors related to BGP process.
HTH.
Regards,
RS.
11-26-2014 02:03 AM
Hi Rajeev,
thanks for your suggestion. Unfortunately no TAC access :-(
Thanks,
Thomas
11-26-2014 04:25 AM
Hey Thomas,
Do you see any BGP flaps?
Regards,
RS.
11-26-2014 04:51 AM
Hi Rajeev,
no all BGP sessions are up and running. But we had 3 days before an unexpected (spontaneously) reboot. The crash_info file was not informative about the cause of the reboot.
Kind regards,
Thomas
11-26-2014 10:30 AM
Hey Thomas,
Try removing netflow, if possible, and check if situation improves.
Regards,
RS.
11-26-2014 11:28 AM
Hi Rajeev,
we have temporary disabled Netflow after we ran into a flapping BGP session problem one hour ago. Router killed BGP related processes due to no free CPU time.
After disabling Netflow, the load of the RP has been decreased dramaticly from <80% to ~30%. But we can't disable Netflow at all because we need the flows to get information for our DDoS protection.
We have now increased CPU ressources for processes:
(no scheduler max-task-time) scheduler allocate 10000 4000
CPU load is (very) high again:
c6k-05.nc#sh proc cpu sorted CPU utilization for five seconds: 90%/21%; one minute: 85%; five minutes: 86% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 252 107378140 111133354 966 23.43% 25.49% 23.13% 0 Earl NDE Task 563 7056500 8738616 807 17.75% 2.72% 2.12% 0 SNMP ENGINE 354 21164144 876799 24138 5.75% 6.37% 6.35% 0 CEF: IPv4 proces 12 22983692 11691250 1965 5.43% 6.08% 5.60% 0 ARP Input 273 10269700 12841900 799 4.95% 9.15% 11.49% 0 IP Input 329 889568 11066 80387 3.35% 0.41% 0.27% 0 IP Background 275 8864220 22191938 399 2.63% 2.48% 2.50% 0 ADJ resolve proc 342 3048616 225163 13539 1.51% 0.73% 0.72% 0 IPC LC Message H 514 3188912 929923 3429 1.27% 0.84% 0.89% 0 BGP Router 52 319360 7229 44177 1.19% 0.14% 0.07% 0 Per-minute Jobs 561 634184 4164207 152 0.63% 0.18% 0.16% 0 IP SNMP
All is unsatisfactory :-(
Kind regards,
Thomas
11-29-2014 07:39 PM
Hey Thomas,
I also see SNMP process using 17%, is it possible to lower sown the SNMP polling on the device.
Regards,
RS.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide