Problem:
We can do more automated action by EEM + TCL on Cisco router, and have more trigger way for syslog pattern trigger, OID trigger, CPU Threshold trigger and so on. That will match IOS platform, no any issue. But in XR platform, each LC/RSP have separate alarm, we maybe have special requirement, e.g:
Some alarms frequency happen, I want to restart the process (base on pid) if the alarm happen 3 times in 5min on each LC, how to do that?
0/3/cpu0: alarm report "C", Pid = zzz
0/1/cpu0: alarm report "A", Pid = xxx
0/2/cpu0: alarm report "B", pid = yyy
0/3/cpu0: alarm report "C", pid = zzz
0/1/cpu0: alarm report "A", pid = xxx
0/1/cpu0: alarm report "A", pid = xxx
Solution:
We can do interactive script by TCL I/O, create a file in Harddisk/disk which has the history/count of syslog for Lcs. We can read this file using the script whenever the syslog is observed. Based on the number of syslogs the script can take the required action.
The steps will be like this, please check attachment and script flow chart for detail script, in my example, I only dump arp process for testing, please change script base on your requirement, in order to test script, you can add flag to test that, e.g “action_syslog priority info msg “a””:
- Create a file in harddisk/disk which contains the count of syslog and the LC where the syslog is seen
- Run the EEM script whenever the event happens
- Check the file in harddisk/disk for the number of times the issue is seen
- Take the required action incase the count exceeds x times in Y LC/RSP
Script flow chart:
Script test:
Test1: Dump only happened 1 times each LC sdf
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:05:09.295 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:06:41.570 UTC
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/4/cpu0
Tue Jan 28 15:06:55.280 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:07:06.257 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921603 FLAG=1 PID=516231
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331
Test2: Dump happened again for LC 0/0
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:09:27.878 UTC
RP/0/RSP0/CPU0:ASR9010-1#
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:09:39.310 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921603 FLAG=2 PID=516231 <<< flag change 2, time not change
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331
Test3: Dump happened 3 times for LC 0/0 in 10 min
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:12:36.086 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:12:49.300 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921957 FLAG=1 PID=516231 << both flag and time are initial
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331
And you will found have action log, you can change any action!
RP/0/RSP0/CPU0:Jan 28 15:12:38.659 : tclsh[65872]: %HA-HA_EEM-6-ACTION_SYSLOG_LOG_INFO : test1.tcl: show process location
Test4: Dump happened again after 10min for 0/RSP0/cpu0
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp
Tue Jan 28 15:56:37.982 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:56:43.942 UTC
LC=0/RSP0/CPU0 T=1390924599 FLAG=1 PID=573646 << time had initial
LC=0/0/CPU0 T=1390921957 FLAG=1 PID=516231
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331