High CPU load on SUP720-3BXL / C6k

kgtnewmedia · ‎11-25-2014

Hi,

I know it has been discussed many times before, but may I get some new inspiration to solve some problems here.

We've some c6509 with SUP720-3BXL. One of routers has a WS-X6704-10G module with DFC. This is the only card installed besides the SUP720. We're run into some heavy CPU problems regarding the RP:

c6k-05#sh proc cpu sorted
CPU utilization for five seconds: 71%/12%; one minute: 77%; five minutes: 78%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
 252    70001864  72821359        961 37.88% 34.97% 33.83%   0 Earl NDE Task
  12    13535512   7478495       1809  9.67%  7.18%  5.88%   0 ARP Input
 354    13418508    655279      20477  7.27%  6.52%  6.47%   0 CEF: IPv4 proces
 275     5628156  14433722        389  2.79%  2.47%  2.46%   0 ADJ resolve proc
 273     5720936   8782567        651  0.31%  0.56%  2.06%   0 IP Input

SP looks like normal:

c6k-05#remote command switch sh proc cpu sorted

CPU utilization for five seconds: 24%/0%; one minute: 24%; five minutes: 25%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
 316     5711876    422724      13512  6.55%  3.36%  3.18%   0 Hardware API bac
 109    10629720  20123627        528  6.31%  6.87%  6.98%   0 slcp process
   3    10847468    988165      10977  5.27%  6.11%  6.13%   0 CEF: IPv4 proces
 253     3055016    108213      28231  1.67%  1.70%  1.70%   0 Vlan Statistics

c6k-05.nc#sh proc cpu history

    7777666668888888888999998888877777777766666777777777777777
    6666888880000044444222220000000000333399999555557777755555
100
 90                    *****
 80 ****     ********************              ***************
 70 **********************************************************
 60 **********************************************************
 50 **********************************************************
 40 **********************************************************
 30 **********************************************************
 20 **********************************************************
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)

    9999988999999999888989899999999889999999999999999999999999
    1522389859375081979774546389531690556610500213560563134442
100  *     *** ** *    *    * *#*     ****  *     ** **
 90 **************************##******************************
 80 #############*#########*##################################
 70 ##########################################################
 60 ##########################################################
 50 ##########################################################
 40 ##########################################################
 30 ##########################################################
 20 ##########################################################
 10 ##########################################################
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per minute (last 60 minutes)
              * = maximum CPU%   # = average CPU%

    9999999999999999999999999999999999999999999999999
    9999999999999999999999999999998999999999999999999
100 *************************************************
 90 *****##******************************************
 80 #################################################
 70 #################################################
 60 #################################################
 50 #################################################
 40 #################################################
 30 #################################################
 20 #################################################
 10 #################################################
   0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
             0    5    0    5    0    5    0    5    0    5    0    5    0
                   CPU% per hour (last 72 hours)
                  * = maximum CPU%   # = average CPU%

Our Netflow statistics will consume many of the CPU time, this is clear.

ip flow-cache timeout inactive 10
ip flow-cache timeout active 1
...
mls aging fast time 14
mls aging long 64
mls aging normal 45
mls netflow interface
mls netflow usage notify 90 21600
mls flow ip interface-full
mls flow ipv6 interface-full
mls nde sender version 5

Anything else to search for? This box has a lot of L3 VLANs and many many /24 subnets as secondary addresses inside the SVIs. QoS and COOP is enabled and heavily used. The box receive one full-table (~508k), one partial table (~130k) and two iBGP feeds with ~1k prefixes.

c6k-05#sh int te1/1 | i 5 minute
  5 minute input rate 917021000 bits/sec, 576857 packets/sec
  5 minute output rate 1021949000 bits/sec, 376221 packets/sec

c6k-05.nc#sh int te1/3 | i 5 minute
  5 minute input rate 1266207000 bits/sec, 450679 packets/sec
  5 minute output rate 2140145000 bits/sec, 1342052 packets/sec

c6k-05#sh ver | i IOS
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(33)SXI14, RELEASE SOFTWARE (fc2)

Don't know if this is CPU related: (messages appears every 2-3 min. in the log)

Nov 25 15:51:45.558 CET: %SYS-3-CPUHOG: Task is running for (204)msecs, more than (200)msecs (14/7),process = BGP Scheduler.
-Traceback= 4035A898 4036D818 4036D9AC 4166696C 4036F228 4036F310 4166BB70 4166BB5C

Any suggestions? Thanks in advance.

Thomas

Rajeev Sharma · ‎11-26-2014

Hey Thomas,

Netflow is definitely taking it toll from switch as seen in #show proc cpu output:

252    70001864  72821359        961 37.88% 34.97% 33.83%   0 Earl NDE Task

I suggest opening a TAC for for deeper investigation as switch is also receiving CPU hogging traceback errors related to BGP process.

HTH.

Regards,
RS.

kgtnewmedia · ‎11-26-2014

Hi Rajeev,

thanks for your suggestion. Unfortunately no TAC access :-(

Thanks,

Thomas

Rajeev Sharma · ‎11-26-2014

Hey Thomas,

Do you see any BGP flaps?

Regards,

RS.

kgtnewmedia · ‎11-26-2014

Hi Rajeev,

no all BGP sessions are up and running. But we had 3 days before an unexpected (spontaneously) reboot. The crash_info file was not informative about the cause of the reboot.

Kind regards,

Thomas

Rajeev Sharma · ‎11-26-2014

Hey Thomas,

Try removing netflow, if possible, and check if situation improves.

Regards,

RS.

kgtnewmedia · ‎11-26-2014

Hi Rajeev,

we have temporary disabled Netflow after we ran into a flapping BGP session problem one hour ago. Router killed BGP related processes due to no free CPU time.

After disabling Netflow, the load of the RP has been decreased dramaticly from <80% to ~30%. But we can't disable Netflow at all because we need the flows to get information for our DDoS protection.

We have now increased CPU ressources for processes:

(no scheduler max-task-time)

scheduler allocate 10000 4000

CPU load is (very) high again:

c6k-05.nc#sh proc cpu sorted
CPU utilization for five seconds: 90%/21%; one minute: 85%; five minutes: 86%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
 252   107378140 111133354        966 23.43% 25.49% 23.13%   0 Earl NDE Task
 563     7056500   8738616        807 17.75%  2.72%  2.12%   0 SNMP ENGINE
 354    21164144    876799      24138  5.75%  6.37%  6.35%   0 CEF: IPv4 proces
  12    22983692  11691250       1965  5.43%  6.08%  5.60%   0 ARP Input
 273    10269700  12841900        799  4.95%  9.15% 11.49%   0 IP Input
 329      889568     11066      80387  3.35%  0.41%  0.27%   0 IP Background
 275     8864220  22191938        399  2.63%  2.48%  2.50%   0 ADJ resolve proc
 342     3048616    225163      13539  1.51%  0.73%  0.72%   0 IPC LC Message H
 514     3188912    929923       3429  1.27%  0.84%  0.89%   0 BGP Router
  52      319360      7229      44177  1.19%  0.14%  0.07%   0 Per-minute Jobs
 561      634184   4164207        152  0.63%  0.18%  0.16%   0 IP SNMP

All is unsatisfactory :-(

Kind regards,

Thomas

Rajeev Sharma · ‎11-29-2014

Hey Thomas,

I also see SNMP process using 17%, is it possible to lower sown the SNMP polling on the device.

Regards,

RS.