Thanks @dkok and @Joe Clarke

ffacilities · ‎07-14-2017

Hi all,

We are doing T.37 fax on 2811 and 2911 routers, with calls coming in over two T1 links. Occasionally the unit hits a bug where a call gets stuck and the CPU gradually rises up to near 100 over about 45 minutes. It stays close to 100%, almost all in the DocMSP process, with the unit rejecting all incoming calls until something tears down the stuck call when everything returns to normal.

Cisco TAC have been unable to identify or fix the bug, so we have implemented an EEM script to detect the high CPU and bounce the two T1 links. Here is the script, triggered on the call rejection logs:

event manager applet high_cpu_recovery
event syslog pattern "IVR-3-LOW_CPU_RESOURCE"
action 1.0 syslog msg "----HIGH CPU DETECTED, BOUNCING T1s----"
action 2.0 cli command "enable"
action 3.0 cli command "show clock | append flash:high_cpu_recovery.txt"
action 4.0 cli command "show call active fax brief | append flash:high_cpu_recovery.txt"
action 5.0 cli command "config t"
action 5.1 cli command "controller t1 1/0"
action 5.2 cli command "shutdown"
action 5.3 cli command "controller t1 1/1"
action 5.4 cli command "shutdown"
action 5.5 cli command "controller t1 1/0"
action 5.6 cli command "no shutdown"
action 5.7 cli command "controller t1 1/1"
action 5.8 cli command "no shutdown"
action 5.9 cli command "end"

The script seems to work fine functionally (tested by having it trigger off a user-defined log event instead of the high CPU event), but it seems that when the CPU is very high the script definitely gets triggered but often just doesn't seem to run. 30 minutes or an hour later, it still hasn't bounced the T1 links.

We have the following config line attempting to give more priority to the EEM script, but it doesn't seem to be helping much:

scheduler allocate 40000 5000

I have also seen mention of a 'scheduler interval' command to allow time for low-priority processes, but that doesn't seem to be available on this platform.

Any suggestions for other ways to give more priority to the EEM script, or better values for the 'scheduler allocate' command?

Thanks,

Ollie

Joe Clarke · ‎07-14-2017

It could be that the script is hitting its maxrun timer when the router is very heavily loaded. Try adding "maxrun 60" to the end of the event specification line.

Report Inappropriate Content · ‎07-19-2017

How about triggering your applet another way?:

event manager applet high_cpu_recovery
event ioswdsysmon sub1 cpu-proc taskname “DocMSP” op gt val 50 is-percent true period 60
action 1.0 syslog msg "----HIGH CPU DETECTED, BOUNCING T1s----"
... and so on ...

This difference from your script is triggering on IOS system monitor counters rather than a syslog message. The theory being that using the IOS system monitor counters will allow you to watch the CPU utilization for the DocMSP process and run your script before the CPU reaches 100% so there's some CPU left to run it. I don't know if 50% ("val 50" above) is the right number for the threshold, given your long experience with this issue you know what constitutes values that aren't sane for DocMSP CPU utilization.

My syntax above may not be 100% correct, if not it's documented here:

http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/eem/command/eem-cr-book/eem-cr-e1.html

I'm the SE for TDS by the way. Gio just brought this issue to my attention yesterday. Thank you for your hard work on this to date.

Joe Clarke · ‎07-20-2017

This will not help if, as I propose, the maxrun time is being hit. When the CPU is high, and especially if AAA command authorization is being used, each command can take a long time to execute thus pushing the policy toward its default 20 second maxrun time. I would look at maxrun first, especially if the "show logg" shows the syslog message is being generated.

ffacilities · ‎08-02-2017

Thanks dkok and jclarke for your replies. It'll take a while before we can tell if either or both changes do the trick; fingers crossed.

EEM script fails to run when CPU is very high