High CPU event detection methods on Cisco routers switches

krunal_shah · ‎11-12-2010

In this document I will include 4 ways to detect high CPU spikes on Cisco routers and switches and do something about it using EEM (Embedded Event Manager). Later in the document I will add some command that can helpfull in troubelshooting high CPU. This document is just a draft version.I will try to add more and edit weekly bassis to improve quality.

High CPU detection using CLI:

This is traditional method used to find the CPU utilization. Not usefull when CPU spikes are seen on management station in middle of night.

Router#sh processes cpu sorted ?

1min Sort based on 1 minute utilization
5min Sort based on 5 minutes utilization
5sec Sort based on 5 seconds utilization
| Output modifiers
<cr>

Router#sh processes cpu sorted 5min
CPU utilization for five seconds: 2%/0%; one minute: 2%; five minutes: 1%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
179         168    28275159          0 0.79% 0.76% 0.77%   0 HQF Shaper Backg
307         160     3529216          0 0.23% 0.18% 0.16%   0 PPP manager
   5      141912       13430      10566 0.00% 0.10% 0.11%   0 Check heaps
308         248     3529216          0 0.15% 0.09% 0.08%   0 PPP Events
122       26424        5141       5139 0.39% 0.24% 0.06%   0 Exec
   2          36       22595          1 0.07% 0.04% 0.05%   0 Load Meter
180          44     1129318          0 0.00% 0.03% 0.02%   0 RBSCP Background
312          16     1132266          0 0.00% 0.03% 0.02%   0 FR Broadcast Out
42          64      113310          0 0.07% 0.02% 0.00%   0 Per-Second Jobs
65          28      451871          0 0.00% 0.01% 0.00%   0 Netclock Backgro
273        1508       37676         40 0.00% 0.01% 0.00%   0 OSPF-1 Hello


Total CPU Utilization is comprised of process and interrupt percentages. These values are found on the first line of
 the output:
 CPU utilization for five seconds: x%/y%; one minute: a%; five minutes: b%
  Total CPU Utilization: x%
  Process Utilization: (x - y)%
  Interrupt Utilization: y%
Process Utilization is the difference between the Total and Interrupt (x and y). 
The one and five minute utilizations are exponentially decayed averages (rather than an arithmetic average),
 therefore recent values have more influence on the calculated average.

High CPU detection using Embedded Resource Manager(ERM):

/**************************************************************************************/
resource policy
policy HIGHCPU global
   system
    cpu interrupt
     critical rising 90 interval 10
     major rising 70 interval 10
     minor rising 40 interval 10
    !
    cpu process
     critical rising 80 interval 10
     major rising 60 interval 10
     minor rising 40 interval 10
    !

event manager applet HIGHCPU-ERM
event resource policy "HIGHCPU"
action 1.0 cli command "enable"
action 2.0 cli command "show proc cpu sorted 5min"
action 3.0 mail server "198.2.5.10" to "tac@cisco.com" from "NOC@mycompany.com" subject "CPU Alert 5min" body "$_cli_result"

/************************************************************************************/

You can set rising and falling values for critical, major, and minor levels of thresholds. When the resource utilization exceeds the rising threshold level, an Up notification is sent. When the resource utilization falls below the falling threshold level, a Down notification is sent. This is more granual because CPU by interrupt and cpu by process can be monitored.

EEM applet will send email to tac@cisco.com with necessary result of the show proc cpu.

Action can only be triggered via Embedded Event Manager 2.2

Available in Following IOS or higher

12.4(2)T
12.2(31)SB3
12.2(33)SRB

High CPU detection using RMON:

/***********************************************************************************/

rmon event 1 log description "CPU has crossed rising threshold"
rmon alarm 12 cpmCPUTotalTable.1.8.1 60 absolute rising-threshold 80 1 falling-threshold 40

!!! polling interval 60 seconds and 80 percent CPU utilization
!!! %RMON-5-RISINGTRAP: Rising trap is generated because the value of cpmCPUTotalTable.1.8.1 has exceeded the rising-threshold value 60

event manager applet HIGHCPU-RMON
event syslog pattern ".*%RMON-5-RISINGTRAP.*"

action 1.0 echo " Do what ever you want about it"

/**********************************************************************************/

This is useful only when switch has one RMON event configured since it uses syslog event detector to match the RMON syslog pattern. For platform specific RMON support check following URL.

http://www.cisco.com/en/US/docs/ios/netmgmt/configuration/guide/netmgmt_rmon_supp_roadmap.html

High CPU detection using SNMP:

!
snmp-server enable traps cpu threshold
snmp-server host 192.168.2.1 traps version 2c public cpu
process cpu threshold type total rising 80 interval 60 falling 40 interval 60
process cpu statistics limit entry-percentage 70 size 300
!

Above configuration detect the high CPU usage similar way we did it with RMON. It sends a trap to management station. While configuring thresold type you can also make it granular for process and interrupt level. For more information about the command refer to 12.4T configuration guide.

http://www.cisco.com/en/US/docs/ios/netmgmt/configuration/guide/nm_cpu_thresh_notif_ps6441_TSD_Products_Configuration_Guide_Chapter.html

You can also use SNMP event type to configure EEM applet. Following applet stores relevant show commands in flash:high_cpu.txt. this applet is written on ISR routers in case of 65XX,45XX and 76XX use bootflash:high_cpu.txt when redirecting the output to a file. Removes itself from config after completion.

It requires SNMP be enabled and EEM v2.1. Event statement has to be use with care because sometimes sudden spikes in CPU usage might cause the actions not to run.Choose poll interval carefully, more command you add into the actions it will take long time to run so if that duration exceeds poll interval event will be detected once again and it will overwrite the high_cpu.txt file.

event manager scheduler script thread class default number 1

event manager applet High_CPU_SNMP

! Cisco process MIB Object name: cpmCPUTotal1min
! event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.4 get-type next entry-op gt entry-val 80 poll-interval 15
! Cisco process MIB Object name: cpmCPUTotal5min
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.8 get-type next entry-op gt entry-val 80 poll-interval 15

action 0.0 syslog msg "High CPU DETECTED. Please wait - logging Information to flash:high_cpu.txt"

action 0.1 cli command "enable"

action 0.2 cli command "term exec prompt timestamp"

action 0.3 cli command "term len 0"

! redirects the command to flash:/bootflash:/disk0: etc

action 1.2 cli command "show process cpu sorted 5min | redirect flash:high_cpu.txt"
! action 1.2 cli command "show process cpu sorted 1min | redirect flash:high_cpu.txt"
action 1.3 cli command "show buffer input-interface GigabitEthernet0/1 dump | redirect flash:high_cpu.txt"
action 1.4 cli command "show cef not | redirect flash:high_cpu.txt"
action 1.5 cli command "show buffer | redirect flash:high_cpu.txt"
action 1.6 cli command "show ip traffic | redirect flash:high_cpu.txt"

! Here you can add any command you want to capture

! in following section it is necessary to remove the EEM from configuration to avoid repeated execution of

! action in case you have many spikes in short period

action 5.1 syslog msg "Finished logging information to flash:high_cpu.txt..."
action 5.1 syslog msg "Self-removing applet from configuration..."

action 9.1 cli command "configure terminal"
action 9.2 cli command "no event manager applet High_CPU_SNMP"
action 9.3 cli command "end"

! End of script

Packet capture using built-in tools:

On 6500 platforms with Sup 720 PFC and MSFC with IOS code 12.2(33) SXH OR SXI you can do the Net driver captures, it captures the packets hitting CPU for processing instead of hardware switching.

Switch#debug netdr capture rx

Switch#show netdr captured-packets

On 4500 Platforms you can capture CPU bounded packets using following command

Switch#debug platform packet all receive buffer 
platform packet debugging is on 
Switch#show platform cpu packet buffered

CPU profiling for ISR and 7200 routers:

CPU profiling is a low-overhead way of determining where the CPU spends its time. The system works by sampling the processor location every four milliseconds. The count for that location in memory is incremented. The root cause of this CPU utilization will be determined by CPU Profiling.

Router#profile task ?
<0-4294967295> list of specific process ids (pids) <---look for Process ID by issueing "show proc cpu sorted" command output ( first column)
all profile all processes
interrupt includes interrupt related data in profile
<cr>

Router# show profile terse

CPU profiling can also be usefull to find any bug related to a process, for more info about CPU profiling for interrupts check reference # 2.

Additional references :

Troubleshooting High CPU Utilization on Cisco Routers,
http://www.cisco.com/en/US/products/hw/routers/ps133/products_tech_note09186a00800a70f2.shtml
Troubleshooting High CPU Utilization Due to Interrupts,
http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a00801c2af0.shtml
Troubleshooting High CPU Utilization due to Processes,
http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a00801c2af6.shtml
High CPU Utilization on Cisco IOS Software-Based Catalyst 4500 Switches,
http://www.cisco.com/en/US/products/hw/switches/ps663/products_tech_note09186a00804cef15.shtml
Catalyst 6500/6000 Switch High CPU Utilization,
http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00804916e0.shtml
Catalyst 3750 Series Switches High CPU Utilization Troubleshooting,
http://www.cisco.com/en/US/products/hw/switches/ps5023/products_tech_note09186a00807213f5.shtml
High CPU Utilization on Catalyst 2900XL/3500XL Switches,
http://www.cisco.com/en/US/products/hw/switches/ps607/products_tech_note09186a0080094e78.shtml
Troubleshooting High CPU Utilization in IP Input Process,
http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a00801c2af3.shtml