Hi,What whas de OID you used

dcgl · ‎02-25-2013

Hi all,

I have a client that experiences high CPU usage on a pair of C4510Rs at random times. The issue causes their HSRP to flap which in turn causes network outages. It only seems to occur for around 10 to 15 minutes and by the time I am called to look into the problem it has passed. I suspect it is some type of traffic that is being CPU punted and is overloading the CPU and I would like to collect some CPU stats and CPU packet captures while the problem is occuring to further troubleshoot.

I used a previous C4510R EEM script (https://supportforums.cisco.com/thread/2065177) as a template and created the following script:

event manager applet cpu_stats

event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.3.1 get-type exact entry-op ge entry-val "70" exit-op lt exit-val "70" poll-interval 5 maxrun 300

action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU: $_snmp_oid_val %"

action 1.02 cli command "enable"

action 1.03 cli command "show clock | append bootflash:cpu_stats"

action 1.04 cli command "show proc cpu sort | append bootflash:cpu_stats"

action 1.05 cli command "show platform health | append bootflash:cpu_stats"

action 1.06 cli command "debug platform packet all count"

action 1.07 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"

action 1.08 cli command "show proc cpu history | append bootflash:cpu_stats"

action 1.09 cli command "debug platform packet all receive buffer"

action 1.10 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"

action 1.11 cli command "show proc cpu sort | append bootflash:cpu_stats"

action 1.12 cli command "show platform cpu packet statistics | append bootflash:cpu_stats"

action 1.13 cli command "Show proc cpu history | append bootflash:cpu_stats"

action 1.14 cli command "show platform cpu packet buffered | append bootflash:cpu_stats"

action 1.15 cli command "show log | append bootflash:cpu_stats"

action 1.16 cli command "undebug all"

!

end

However when looking at the CPU history after the problem occured again it appears that the SUP7E has one physical CPU that has two cores and one or two of the cores hits 100% when the issue occurs while the system CPU stays low as can be seen below:

C4510#show proc cpu hist

History information for system:

...

output omitted

...

111111111115511111111111111111111111111111111111111111111111111111111111

600105000112231001211131111017221011110001112010212016101010121012112112

100

90

80

70

60

50 **

40 **

30 #*

20 * * ## * *

10 ########################################################################

0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.

0 5 0 5 0 5 0 5 0 5 0 5 0

CPU% per hour (last 72 hours)

* = maximum CPU% # = average CPU%

History information for core 0:

...

output omitted

...

991

888787888789968888787877787787888788787888879898888898888788

100 **

90 **

80 **

70 **

60 **

50 **

40 **

30 *#

20 ##*

10 ***********##***********************************************************

0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.

0 5 0 5 0 5 0 5 0 5 0 5 0

CPU% per hour (last 72 hours)

* = maximum CPU% # = average CPU%

History information for core 1:

...

output omitted

...

211112111119921111111111111113111111111111111111111112111111

855554544659906476955796556651995456554446767454847468565565

100 **

90 **

80 **

70 **

60 **

50 **

40 **

30 * #* * *

20 ******* **##** ***************** **** **** * * * ********************

10 ########################################################################

0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.

0 5 0 5 0 5 0 5 0 5 0 5 0

CPU% per hour (last 72 hours)

* = maximum CPU% # = average CPU%

So I walked through the SNMP OIDs for the two cores without luck (also came across this article

https://supportforums.cisco.com/thread/2179888 to create SNMP OIDs for the cores by parsing the output of "show proc cpu" with a cron job). An SNMP walk of the CPU MIBs comes back with the following:

.1.3.6.1.4.1.9.9.109.1.1.1.1.3 - cpmCPUTotal5sec.5000

.1.3.6.1.4.1.9.9.109.1.1.1.1.4 - cpmCPUTotal1min.5000

.1.3.6.1.4.1.9.9.109.1.1.1.1.5 - cpmCPUTotal5min.5000

.1.3.6.1.4.1.9.9.109.1.1.1.1.6 - cpmCPUTotal5secRev.5000

.1.3.6.1.4.1.9.9.109.1.1.1.1.7 - cpmCPUTotal1minRev.5000

.1.3.6.1.4.1.9.9.109.1.1.1.1.8 - cpmCPUTotal5minRev.5000

Each of which seems to correspond with the system CPU. So in order to continue without creating an overly complex EEM scripts that parse the show proc cpu output and pass it to another EEM script that checks the value I decided to try a lower entry and exit value of 40%. The next time the problem occured the system CPU went over 40% but did not seem to trigger the script. What is important to note is the first time I did a full SNMP walk of the OID I was able to get the CPU over 50% and the script triggered.

So I moved down to 20% for the trigger value, tried with different SNMP OIDs like .1.3.6.1.4.1.9.9.109.1.1.1.1.6.5000 and doing an SNMP walk of the Cisco MIB without success in getting the script triggered (see attached output). Is there a problem with the script or the SNMP MIBs I am using?

Thanks in advance!

David.

Joe Clarke · ‎02-25-2013

Your attachment apparently contained a virus and was stripped. However, you may want to read

https://supportforums.cisco.com/thread/2179888?tstart=0 . We recently had a discussion about how to monitor these multiple CPUs. The script in that thread could be adapted to do what you want. Yes, it's complex, but it should be able to extract all of the CPU details from the multiple cores.

EDIT: Whoops, missed you looked at that thread. I still think that may be the recourse for you, but I haven't been able to see your attachment.

dcgl · ‎02-25-2013

Thats funny was just a plain text file Ì attached. I have re-attached the file, hopefully it works now.

So what you are suggesting is to create a first script to parse the output of show proc cpu every say 30 seconds to a minute, place the output into user defined OIDs and then get a second script to trigger on these OIDs?

Or is there a way I can get the script to trigger based the output of "show proc cpu"?

Thanks for your help.

Joe Clarke · ‎02-25-2013

The attachment came through now. You would need to use .5000 instead of .1 on your OID first of all. If that works for you, then you should be set. If not, then yes, you could adapt the previous Tcl script to collect the required data instead of creating the OIDs.

dcgl · ‎03-01-2013

Hi Joseph

I have written the following set of scripts to do the job (good to just post them here if anyone else is having the same problem!), I have tested individual components and they all seem to function as expected, so just waiting now for the high cpu event to occur again:

The first script is just to set the prev_cpu_state to low so the main script can detect the first transition to high and does not spit log messages out every minute because the context variable is not initialised:

event manager applet set_cpu_state

event none

action 1.01 handle-error type ignore

action 1.02 context retrieve key CPU_STATE variable prev_cpu_state

action 1.03 handle-error type exit

action 1.04 set prev_cpu_state "low"

action 1.05 context save key CPU_STATE variable prev_cpu_state

end

The second and main script checks the CPU level of both cores over 5 seconds and takes the highest of the two, if this is over 50% and the cpu state was previously low then it will trigger the third script.

event manager applet dualcore

event timer cron cron-entry "* * * * 0-6"

action 1.00 cli command "en"

action 1.01 cli command "show process cpu | include five"

action 1.02 regexp "Core 0: CPU utilization for five seconds: ([0-9]+)%" $_cli_result result c0cpu5sec

action 1.03 if $_regexp_result eq 1

action 1.04 end

action 1.05 regexp "Core 1: CPU utilization for five seconds: ([0-9]+)%" $_cli_result result c1cpu5sec

action 1.06 if $_regexp_result eq 1

action 1.07 end

action 1.08 handle-error type ignore

action 1.09 context retrieve key CPU_STATE variable prev_cpu_state

action 1.11 handle-error type exit

action 1.12 if $c0cpu5sec ge $c1cpu5sec

action 1.13 set highest_core $c0cpu5sec

action 1.14 else

action 1.15 set highest_core $c1cpu5sec

action 1.16 end

action 1.17 if $highest_core ge 50

action 1.18 if $prev_cpu_state eq low

action 1.19 set prev_cpu_state "high"

action 1.20 syslog msg "----HIGH CPU DETECTED----, CPU: $highest_core %"

action 1.21 cli command "event manager run collect_stats"

action 1.22 end

action 1.23 else

action 1.24 if $prev_cpu_state eq high

action 1.25 set prev_cpu_state "low"

action 1.26 syslog msg "----CPU BACK TO NORMAL----, CPU: $highest_core %"

action 1.27 end

action 1.28 end

action 1.29 context save key CPU_STATE variable prev_cpu_state

The third script just runs all of the require show commands for debugging the issue and appends the output to a file called cpu_stats on the bootflash:

event manager applet collect_stats