02-25-2013 12:51 AM
Hi all,
I have a client that experiences high CPU usage on a pair of C4510Rs at random times. The issue causes their HSRP to flap which in turn causes network outages. It only seems to occur for around 10 to 15 minutes and by the time I am called to look into the problem it has passed. I suspect it is some type of traffic that is being CPU punted and is overloading the CPU and I would like to collect some CPU stats and CPU packet captures while the problem is occuring to further troubleshoot.
I used a previous C4510R EEM script (https://supportforums.cisco.com/thread/2065177) as a template and created the following script:
event manager applet cpu_stats
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.3.1 get-type exact entry-op ge entry-val "70" exit-op lt exit-val "70" poll-interval 5 maxrun 300
action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU: $_snmp_oid_val %"
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append bootflash:cpu_stats"
action 1.04 cli command "show proc cpu sort | append bootflash:cpu_stats"
action 1.05 cli command "show platform health | append bootflash:cpu_stats"
action 1.06 cli command "debug platform packet all count"
action 1.07 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.08 cli command "show proc cpu history | append bootflash:cpu_stats"
action 1.09 cli command "debug platform packet all receive buffer"
action 1.10 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"
action 1.11 cli command "show proc cpu sort | append bootflash:cpu_stats"
action 1.12 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.13 cli command "Show proc cpu history | append bootflash:cpu_stats"
action 1.14 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"
action 1.15 cli command "show log | append bootflash:cpu_stats"
action 1.16 cli command "undebug all"
!
end
However when looking at the CPU history after the problem occured again it appears that the SUP7E has one physical CPU that has two cores and one or two of the cores hits 100% when the issue occurs while the system CPU stays low as can be seen below:
C4510#show proc cpu hist
History information for system:
...
output omitted
...
111111111115511111111111111111111111111111111111111111111111111111111111
600105000112231001211131111017221011110001112010212016101010121012112112
100
90
80
70
60
50 **
40 **
30 #*
20 * * ## * *
10 ########################################################################
0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
0 5 0 5 0 5 0 5 0 5 0 5 0
CPU% per hour (last 72 hours)
* = maximum CPU% # = average CPU%
History information for core 0:
...
output omitted
...
991
888787888789968888787877787787888788787888879898888898888788
100 **
90 **
80 **
70 **
60 **
50 **
40 **
30 *#
20 ##*
10 ***********##***********************************************************
0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
0 5 0 5 0 5 0 5 0 5 0 5 0
CPU% per hour (last 72 hours)
* = maximum CPU% # = average CPU%
History information for core 1:
...
output omitted
...
211112111119921111111111111113111111111111111111111112111111
855554544659906476955796556651995456554446767454847468565565
100 **
90 **
80 **
70 **
60 **
50 **
40 **
30 * #* * *
20 ******* **##** ***************** **** **** * * * ********************
10 ########################################################################
0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
0 5 0 5 0 5 0 5 0 5 0 5 0
CPU% per hour (last 72 hours)
* = maximum CPU% # = average CPU%
So I walked through the SNMP OIDs for the two cores without luck (also came across this article
https://supportforums.cisco.com/thread/2179888 to create SNMP OIDs for the cores by parsing the output of "show proc cpu" with a cron job). An SNMP walk of the CPU MIBs comes back with the following:
.1.3.6.1.4.1.9.9.109.1.1.1.1.3 - cpmCPUTotal5sec.5000
.1.3.6.1.4.1.9.9.109.1.1.1.1.4 - cpmCPUTotal1min.5000
.1.3.6.1.4.1.9.9.109.1.1.1.1.5 - cpmCPUTotal5min.5000
.1.3.6.1.4.1.9.9.109.1.1.1.1.6 - cpmCPUTotal5secRev.5000
.1.3.6.1.4.1.9.9.109.1.1.1.1.7 - cpmCPUTotal1minRev.5000
.1.3.6.1.4.1.9.9.109.1.1.1.1.8 - cpmCPUTotal5minRev.5000
Each of which seems to correspond with the system CPU. So in order to continue without creating an overly complex EEM scripts that parse the show proc cpu output and pass it to another EEM script that checks the value I decided to try a lower entry and exit value of 40%. The next time the problem occured the system CPU went over 40% but did not seem to trigger the script. What is important to note is the first time I did a full SNMP walk of the OID I was able to get the CPU over 50% and the script triggered.
So I moved down to 20% for the trigger value, tried with different SNMP OIDs like .1.3.6.1.4.1.9.9.109.1.1.1.1.6.5000 and doing an SNMP walk of the Cisco MIB without success in getting the script triggered (see attached output). Is there a problem with the script or the SNMP MIBs I am using?
Thanks in advance!
David.
02-25-2013 11:58 AM
Your attachment apparently contained a virus and was stripped. However, you may want to read
https://supportforums.cisco.com/thread/2179888?tstart=0 . We recently had a discussion about how to monitor these multiple CPUs. The script in that thread could be adapted to do what you want. Yes, it's complex, but it should be able to extract all of the CPU details from the multiple cores.
EDIT: Whoops, missed you looked at that thread. I still think that may be the recourse for you, but I haven't been able to see your attachment.
02-25-2013 12:14 PM
Thats funny was just a plain text file Ì attached. I have re-attached the file, hopefully it works now.
So what you are suggesting is to create a first script to parse the output of show proc cpu every say 30 seconds to a minute, place the output into user defined OIDs and then get a second script to trigger on these OIDs?
Or is there a way I can get the script to trigger based the output of "show proc cpu"?
Thanks for your help.
02-25-2013 01:03 PM
The attachment came through now. You would need to use .5000 instead of .1 on your OID first of all. If that works for you, then you should be set. If not, then yes, you could adapt the previous Tcl script to collect the required data instead of creating the OIDs.
03-01-2013 06:36 AM
Hi Joseph
I have written the following set of scripts to do the job (good to just post them here if anyone else is having the same problem!), I have tested individual components and they all seem to function as expected, so just waiting now for the high cpu event to occur again:
The first script is just to set the prev_cpu_state to low so the main script can detect the first transition to high and does not spit log messages out every minute because the context variable is not initialised:
event manager applet set_cpu_state
event none
action 1.01 handle-error type ignore
action 1.02 context retrieve key CPU_STATE variable prev_cpu_state
action 1.03 handle-error type exit
action 1.04 set prev_cpu_state "low"
action 1.05 context save key CPU_STATE variable prev_cpu_state
end
The second and main script checks the CPU level of both cores over 5 seconds and takes the highest of the two, if this is over 50% and the cpu state was previously low then it will trigger the third script.
event manager applet dualcore
event timer cron cron-entry "* * * * 0-6"
action 1.00 cli command "en"
action 1.01 cli command "show process cpu | include five"
action 1.02 regexp "Core 0: CPU utilization for five seconds: ([0-9]+)%" $_cli_result result c0cpu5sec
action 1.03 if $_regexp_result eq 1
action 1.04 end
action 1.05 regexp "Core 1: CPU utilization for five seconds: ([0-9]+)%" $_cli_result result c1cpu5sec
action 1.06 if $_regexp_result eq 1
action 1.07 end
action 1.08 handle-error type ignore
action 1.09 context retrieve key CPU_STATE variable prev_cpu_state
action 1.11 handle-error type exit
action 1.12 if $c0cpu5sec ge $c1cpu5sec
action 1.13 set highest_core $c0cpu5sec
action 1.14 else
action 1.15 set highest_core $c1cpu5sec
action 1.16 end
action 1.17 if $highest_core ge 50
action 1.18 if $prev_cpu_state eq low
action 1.19 set prev_cpu_state "high"
action 1.20 syslog msg "----HIGH CPU DETECTED----, CPU: $highest_core %"
action 1.21 cli command "event manager run collect_stats"
action 1.22 end
action 1.23 else
action 1.24 if $prev_cpu_state eq high
action 1.25 set prev_cpu_state "low"
action 1.26 syslog msg "----CPU BACK TO NORMAL----, CPU: $highest_core %"
action 1.27 end
action 1.28 end
action 1.29 context save key CPU_STATE variable prev_cpu_state
The third script just runs all of the require show commands for debugging the issue and appends the output to a file called cpu_stats on the bootflash:
event manager applet collect_stats
event none
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append bootflash:cpu_stats"
action 1.04 cli command "show proc cpu sort | append bootflash:cpu_stats"
action 1.05 cli command "show platform health | append bootflash:cpu_stats"
action 1.06 cli command "debug platform packet all count"
action 1.07 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.08 cli command "show proc cpu history | append bootflash:cpu_stats"
action 1.09 cli command "debug platform packet all receive buffer"
action 1.10 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"
action 1.11 cli command "show proc cpu sort | append bootflash:cpu_stats"
action 1.12 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.13 cli command "Show proc cpu history | append bootflash:cpu_stats"
action 1.14 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"
action 1.15 cli command "show log | append bootflash:cpu_stats"
action 1.16 cli command "undebug all"
!
end
Regards,
David.
04-29-2015 10:33 AM
Hi,
What whas de OID you used in the end?
Thanks in advance.
04-29-2015 01:44 PM
Hi Miguel,
There was no corresponding OID for the individual cores so I modified the script to run the show proc cpu command and parse the output for the 5 second load for each core as per my previous post.
Below is the last iteration of the scripts I made which helped us find the packets that were causing the high CPU. Note you will need to apply these scripts to your configuration, then run the "set_cpu_state" script to initialise the variables (event manager run set_cpu_state) before running the "dualcore" script (event manager run dualcore).
After the issue has occured you can take a look at the bootflash:cpu_stats file (more bootflash:cpu_stats) to see the output from the show commands when the high cpu event was occuring. Don't forget to stop the script and clean up the old cpu_stats file as it could potentially fill your bootflash if there are lots of high_cpu state changes and the script is left running over a long period.
event manager applet set_cpu_state
event none
action 1.01 handle-error type ignore
action 1.02 context retrieve key CPU_STATE variable prev_cpu_state
action 1.03 handle-error type exit
action 1.04 set prev_cpu_state "low"
action 1.05 context save key CPU_STATE variable prev_cpu_state
!
event manager applet dualcore
event timer cron cron-entry "* * * * 0-6" maxrun 60
action 1.00 cli command "en"
action 1.01 cli command "show process cpu | include five"
action 1.02 regexp "Core 0: CPU utilization for five seconds: ([0-9]+)%" "$_cli_result" result c0cpu5sec
action 1.03 if $_regexp_result eq "1"
action 1.04 end
action 1.05 regexp "Core 1: CPU utilization for five seconds: ([0-9]+)%" "$_cli_result" result c1cpu5sec
action 1.06 if $_regexp_result eq "1"
action 1.07 end
action 1.08 handle-error type ignore
action 1.09 context retrieve key CPU_STATE variable "prev_cpu_state"
action 1.10 handle-error type exit
action 1.12 if $c0cpu5sec ge "$c1cpu5sec"
action 1.13 set highest_core "$c0cpu5sec"
action 1.14 else
action 1.15 set highest_core "$c1cpu5sec"
action 1.16 end
action 1.17 if $highest_core ge "50"
action 1.18 if $prev_cpu_state eq "low"
action 1.19 set prev_cpu_state "high"
action 1.20 syslog msg "----HIGH CPU DETECTED----, CPU: $highest_core %"
action 1.21 context save key CPU_STATE variable "prev_cpu_state"
action 1.22 cli command "event manager run collect_stats_high"
action 1.23 else
action 1.24 context save key CPU_STATE variable "prev_cpu_state"
action 1.25 end
action 1.26 else
action 1.27 if $prev_cpu_state eq "high"
action 1.28 set prev_cpu_state "low"
action 1.29 syslog msg "----CPU BACK TO NORMAL----, CPU: $highest_core %"
action 1.30 context save key CPU_STATE variable "prev_cpu_state"
action 1.31 cli command "event manager run collect_stats_low"
action 1.32 else
action 1.33 context save key CPU_STATE variable "prev_cpu_state"
action 1.34 end
action 1.35 end
!
event manager applet collect_stats_high
event none maxrun 60
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append bootflash:cpu_stats"
action 1.04 cli command "show proc cpu sort | append bootflash:cpu_stats"
action 1.05 cli command "show platform health | append bootflash:cpu_stats"
action 1.06 cli command "debug platform packet all count"
action 1.07 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.08 cli command "show proc cpu history | append bootflash:cpu_stats"
action 1.09 cli command "debug platform packet all receive buffer"
action 1.10 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"
action 1.11 cli command "show proc cpu sort | append bootflash:cpu_stats"
action 1.12 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.13 cli command "Show proc cpu history | append bootflash:cpu_stats"
action 1.14 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"
action 1.15 cli command "show log | append bootflash:cpu_stats"
action 1.16 cli command "undebug platform packet all receive buffer"
event manager applet collect_stats_low
event none maxrun 60
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append bootflash:cpu_stats"
action 1.07 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"
action 1.16 cli command "undebug all"
Hope this helps!
/David
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide