Using Embedded Event Manager (EEM) in IOS-XR for the ASR9000 to simulate ECMP "min-links"

xthuijs · ‎03-02-2011

Introduction

This document provides a sample configuration for using EEM with the IOS-XR releases for the ASR9000. Though EEM is found in other XR platforms also, this example is specifically tested and demonstrated for hte ASR9000.

Core Issue

Using LAG or Link Aggregation or what IOS calls EtherChannel you have the ability to use the "min-links" feature. It allows the system to shut down the whole bundle when there are not enough members available, hence aggregated bandwidth to support the traffic.

There are cases where you cannot use or don't want to use LAG, but rather use ECMP, Equal Cost Multipath, to support more aggregated bandwidth.

In that case, when a certain number of path disappears, there is no way to automatically shut other paths to instantiate failover to a redundant ECMP path.

This document will show how to use EEM to detect an ECMP member going down and automatically shut the other members of that path.

Resolution

a) Preparation

b) XR configuration

c) Setting up interfaces to Monitor

d) How to determine if the script was invoked

e) Sample Operation

a) Preparation:

1) Setup the interfaces that are to be shut when this script kicks in. Edit the minlinks_ecmp.tcl file with a file editor such as notepad. In the top of the script there is a section that provides these variables. Change the interface names as if you’d configure them yourself.

#Interfaces of the ECMP bundle that need to be shut

set _if_1 "interface g0/0/0/29"

set _if_2 "interface g0/0/0/28"

set _if_3 "interface g0/0/0/27"

set _if_4 "interface g0/0/0/26"

2) copy the saved and modified tcl file to the disk0:

RP/0/RSP0/CPU0:Viking-Top#copy tftp://3.0.0.1/minlinks_ecmp.tcl disk0:

b) XR configuration:

(exec cfg)

event manager environment _syslog_pattern .*(GigabitEthernet)(0/0/0/0|0/0/0/20).*(changed state to Down)

event manager directory user policy disk0:

event manager policy minlinks_ecmp.tcl username eem type user

aaa authorization eventmanager default local

(admin cfg)

username eem

group root-system

group cisco-support

!

c) Setting up the interfaces to monitor

The syslog pattern is currently setup for monitoring Gigabit Ethernet interfaces 0/0/0/0 and 0/0/0/20.

If you’d like to monitor TenG interfaces TenGigE0/3/0/0 and TenGigE0/7/0/1 then change the line to:

event manager environment _syslog_pattern .*(TenGigE)(0/3/0/0|0/7/0/1).*(changed state to Down)

Note that everything is case and space sensitive.

d) How to determine if the script was invoked

A SYSLOG message will be printed everytime the script is invoked!

RP/0/RSP0/CPU0:Jan 25 09:57:41.248 : minlinks_ecmp.tcl[65814]: ECMP DISTRIBTION SHUTDOWN ON THIS ROUTER!

e) Sample operation :

We are receiving an event of a interface down that we are monitoring:

LC/0/0/CPU0:Jan 25 10:07:31.742 : ifmgr[171]: %PKT_INFRA-LINK-3-UPDOWN : Interface GigabitEthernet0/0/0/0, changed state to Down

LC/0/0/CPU0:Jan 25 10:07:31.742 : ifmgr[171]: %PKT_INFRA-LINEPROTO-5-UPDOWN : Line protocol on Interface GigabitEthernet0/0/0/0, changed state to Down

RP/0/RSP0/CPU0:Jan 25 10:07:31.758 : ospf[314]: %ROUTING-OSPF-5-ADJCHG : Process CORE, Nbr 172.16.1.1 on GigabitEthernet0/0/0/0 in area 0 from FULL to DOWN, Neighbor Down: interface down or detached,vrf default vrfid 0x60000000

RP/0/RSP0/CPU0:Viking-Top#

!EEM is kicking in to admin shut the pre defined interfaces:

RP/0/RSP0/CPU0:Viking-Top#

LC/0/0/CPU0:Jan 25 10:07:37.931 : ifmgr[171]: %PKT_INFRA-LINK-5-CHANGED : Interface GigabitEthernet0/0/0/26, changed state to Administratively Down

LC/0/0/CPU0:Jan 25 10:07:37.931 : ifmgr[171]: %PKT_INFRA-LINK-5-CHANGED : Interface GigabitEthernet0/0/0/27, changed state to Administratively Down

LC/0/0/CPU0:Jan 25 10:07:37.931 : ifmgr[171]: %PKT_INFRA-LINK-5-CHANGED : Interface GigabitEthernet0/0/0/28, changed state to Administratively Down

LC/0/0/CPU0:Jan 25 10:07:37.932 : ifmgr[171]: %PKT_INFRA-LINK-5-CHANGED : Interface GigabitEthernet0/0/0/29, changed state to Administratively Down

RP/0/RSP0/CPU0:Jan 25 10:07:38.238 : config[65815]: %MGBL-CONFIG-6-DB_COMMIT : Configuration committed by user 'eem'. Use 'show configuration commit changes 1000000656' to view the changes.

** Note the config changes made by the username as per configuration

RP/0/RSP0/CPU0:Jan 25 10:07:38.739 : config[65815]: %MGBL-SYS-5-CONFIG_I : Configured from console by console on vty100 (0.0.0.0)

** Syslog message is printing

RP/0/RSP0/CPU0:Jan 25 10:07:39.007 : minlinks_ecmp.tcl[65812]: ECMP DISTRIBTION SHUTDOWN ON THIS ROUTER!

RP/0/RSP0/CPU0:Viking-Top#

Related Information

XR EEM standard documentation.

Additional note, XR 4.0 and XR4.01 have a bug that will be fixed in 4.1 that prevents proper invokation of TCL upon syslog events. Please consult TAC for ddts and resolution prior to 4.1

Xander Thuijs, CCIE #6775

Sr Tech Lead ASR9000

Andreas Martensson · ‎04-05-2011

Hello

Sorry for hijacking this post.

We are trying to do almost the same thing with eem script. We want to shut down lots of bvi interfaces, but the scripting engine is parsing the script way too slow. So i tested with this attached script (to see if we had done bad scripting)

I noticed the timestamps on the output above (from shutting down the first interface to commiting the changes). I downloaded this script, disabled syslog triggering and run the script (changed interfaces). I added 30 bvi interfaces to the list. The script takes 1.30 min to complete. I think this is way to slow for shutting down 30 interfaces. Is this an ASR9000 issue? no interface-range in xr?

We are running ASR9k with 4.01

xthuijs · ‎04-05-2011

XR4.0.2 has an improvement to increase performance of EEM config actions.

1 minute seems indeed long, do you happen to have tacacs command authorization for the eem user invoking the script/commands? that slows down a lot too.

Andreas Martensson · ‎04-05-2011

No, no tacacs command authorization. Do you know the release date of 4.0.2? I cant find it in the ASR9k roadmap.

dimikhai · ‎04-29-2011

Hi, 4.0.2 won't be released on CCO, you have to wait for 4.0.3 planned in May. I've tested on pre-4.0.2 build and see net perf increase, from 1m30 down to 30 sec on same EEM script.

You can also use the following execution part of script to further speed up:

namespace import ::cisco::eem::*
namespace import ::cisco::lib::*

# Execute the shutdown of the BVI Interfaces
if [catch {cli_open} result] {
error $result $errorInfo
} else {
array set cli1 $result
}

if [catch {cli_exec $cli1(fd) "copy disk0:<name of your config file, see below> running-config"} result] {
error $result $errorInfo
}

if [catch {cli_close $cli1(fd) $cli1(tty_id)} result] {
error $result $errorInfo
}

Then simply create config file and save it on disk0:

interface bviN
shut

interface bviN+1
shut

...

Once you launch the EEM script, it will just merge this config to running config and commit this change. Doing so I got 15 sec execution on 4.0.1. Hope this helps.

/Dimitri

Yigal Dekalo · ‎01-28-2014

if you like to test it you can do the following:

RP/0/RP0/CPU0:CRS#run logger <the syslog pattern>

That should trigger the script to kick in

Pok LY · ‎07-21-2015

Dear Xander,

Could you advise the command for EEM to check our interface interval 30 sec and if the load is >= 200/255 the router need to send email to administrator ?

Regards, Pok.

xthuijs · ‎07-22-2015

here is a tcl snippet that checks a show vlan command and peals out the value from the number of vlans, so you can modify this piece to do the same for the :

RP/0/RSP0/CPU0:A9K-BNG#show int g 0/0/0/0 | i load
Wed Jul 22 04:51:26.029 EDT
reliability 255/255, txload 0/255, rxload 0/255

to get the rx or tx load and use that as a variable to compare.

then an action can be set to send a mail. Iwould recommend however to send a syslog instead, I dont think you want core routers to use an form of SMTP, but thats just me :)

    global vlancnt
   if {[catch {cli_open} result]} {;                   # Open VTY
       return -code error $result
   } else {;                               # set pointer to VTY
       array set cli1 $result
   }
   if {[catch {cli_exec $cli1(fd) "show vlan summary"} capture]} {;   # Get Data
       return -code error $capture
   } else {
       foreach line [split $capture \n ] {;               # Search through the output
           if {[regexp -nocase {^[ \t]*Number of existing VLANs[ \t]*:[ \t]*([0-9]+)} $line _ nvs]} {
               break
           }
       }
   }
   if {[catch {cli_close $cli1(fd) $cli1(tty_id)} result]} {;       # Close TTY, done running commands
       return -code error $result
   }
   if {[info exists nvs] && [info exists vlancnt]} {;           # I have the new and old value
       if {$nvs>$vlancnt} {;                       # vlan count increased
           debug "Info% Vlans Increased from $vlancnt to $nvs"
           set vlancnt $nvs
       }

cheers!

xander

yon.guezuraga · ‎02-01-2017

Hi,

thanks for you nice collaboration.

We have deployed on our ASR9K a TCL script that sends a syslog and email when a Tengiga interfaces goes down.

The TCL is activated and send correctly the syslog, but the funciton "smtp_send_email" doesn't work correctly, we got domain lookup error:

RP/0/RSP0/CPU0:Feb 1 13:20:09.122 : tclsh[65918]: %HA-HA_EEM-6-ACTION_SYSLOG_LOG_INFO : interface_down_giga_v2.tcl: Syslogs Test4
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: Domain name is not configured
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: while executing
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: "smtp_send_email $mail_msg"
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: invoked from within
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: "if [catch {smtp_send_email $mail_msg} result] {
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: error $result $errorInfo
RP/0/RSP0/CPU0:Feb 1 13:20:09.185 : syslog_dev[93]: noscan PID-448184702: }"

This is the TCL relevant config. part:

::cisco::eem::event_register_syslog occurs 1 pattern .*PKT_INFRA-LINK-5-CHANGED*.*MgmtEth0.* maxrun 90

namespace import ::cisco::eem::*
namespace import ::cisco::lib::*

set mail_pre "Mailservername: $_email_server\n"
append mail_pre "From: $_email_from\n"
append mail_pre "To: $_email_to\n"
append mail_pre "Cc: $_email_cc\n"
append mail_pre "Subject: Caida del puertox\n\n"

action_syslog msg "Syslogs Test3"

append mail_pre "MIME-Version: 1.0\n"
append mail_pre "Content-type: multipart/mixed; boundary=\"caida del puerto x del router ASR9K \"\n"
append mail_pre "\n--EEM_email_boundary\n\n"
append mail_pre "\n--EEM_email_boundary\n"
append mail_pre "Content-Type: application/octet-stream\n"
append mail_pre "Content-Transfer-Encoding: Base64\n"
action_syslog msg "Syslogs Test4"

set mail_msg [uplevel #0 [list subst -nobackslashes -nocommands $mail_pre]]
action_syslog msg "Syslogs Test5"
if [catch {smtp_send_email $mail_msg} result] {
error $result $errorInfo
}

xthuijs · ‎02-01-2017

in XR did you configure a

domain name testme.com

or something similar? give that a try. to be honest, I dont know if we implemented smtp client for EEM, so it may not work even after that.

cheers

xander

yon.guezuraga · ‎02-01-2017

Hi xthuijs,

thanks for your support.

Yes, on the ASR9K we added the domain config.:

RP/0/RSP0/CPU0:x.x.x.x#show running-config domain

domain name x.com
domain name-server x.x.x.x
domain lookup source-interface x.x.x.x

The problem is seen on this tcl part:

if [catch {smtp_send_email $mail_msg} result] {
error $result $errorInfo
}

All the previous emails commands (append mail_pre ....) on the TCL are OK.

I don't if if is necessary to add some aditional lib, or create one manually for this smtp_send_email command.

xthuijs · ‎02-01-2017

hi yon,

if you activated the script before the domain name was configured, the registration of eem may not have picked up on that domain name, so you'd need to unconfigure the script registration and re-register it.

another thing is that domainname appears to be a var inside eem also I see now.

so you can also try to set the environment variable _domainname to something.

(script reregistration required possibly).

few things to try :)

xander

yon.guezuraga · ‎02-01-2017

Hi xthuijs,

i just tried adding this variable (event manager environment _domainname x.x) and registered again the Script and it worked.

Thanks a lot for you collaboration.

Best regards.

xthuijs · ‎02-01-2017

Sweet awesome!! thanks for testing and confirming!! :)

cheers!

xander

Marcoslh01 · ‎03-21-2017

Hi Xander! thank you very much for the post, there isn't much about EEM on IOS XR and I based my TCL script mainly on this info.

I made it to work, but I'm having issues when the script tries to 'commit'. It is for sure a user issue, but I don't know what it could be. This is my config about it:

event manager environment _syslog_pattern .*%PKT_INFRA-LINEPROTO-5-UPDOWN : Line protocol on Interface TenGigE0/0/0/[5,6]
event manager directory user policy disk0:

event manager policy if_updown_INT.tcl username eem type user
!
!
tacacs source-interface Loopback0 vrf default
tacacs-server host xxx port 49
!
tacacs-server host xxx port 49
!
tacacs-server key 7 0.....
tacacs-server timeout 1
username eem
group root-system
group cisco-support
secret 5 $1$yg....
!
aaa accounting exec default start-stop group tacacs+
aaa accounting commands default start-stop group tacacs+
aaa authorization exec default group tacacs+
aaa authorization eventmanager default local
aaa authentication login default group tacacs+
cdp
configuration commit auto-save filename disk0:config
line console
authorization exec default
login authentication default
!

On the logs I'm seeing this:

LC/0/0/CPU0:Mar 21 02:52:22 CET: ifmgr[209]: %PKT_INFRA-LINEPROTO-5-UPDOWN : Line protocol on Interface TenGigE0/0/0/5, changed state to Down
RP/0/RSP0/CPU0:Mar 21 02:52:22 CET: BM-DISTRIB[1155]: %L2-BM-6-ACTIVE : TenGigE0/0/0/5 is no longer Active as part of Bundle-Ether5 (Link is down)
RP/0/RSP0/CPU0:Mar 21 02:52:22 CET: l2vpn_mgr[1169]: %L2-L2VPN_ICCP_SM-4-LOCAL_ACCESS_MAIN_PORT_FAILURE : ICCP-SM has detected a local Access Main Port failure: Attempting to failover Bundle-Ether5 ports in redundancy group 1
LC/0/0/CPU0:Mar 21 02:52:23 CET: vic_0[366]: %PLATFORM-VIC-4-RX_LOS : Interface TenGigE0/0/0/5, Detected Rx Loss of Signal
RP/0/RSP0/CPU0:Mar 21 02:52:27 CET: syslog_dev[92]: noscan PID-71455058: if_updown_INT: if1:down if2:up ACTION: interface TenGigE0/0/0/6 -> shutdown
RP/0/RSP0/CPU0:Mar 21 02:52:28 CET: syslog_dev[92]: noscan PID-71455058: To commit
LC/0/0/CPU0:Mar 21 02:52:40 CET: pfm_node_lc[294]: %PLATFORM-SFP-2-LOW_RX_POWER_ALARM : Set|envmon_lc[172120]|0x1029005|Port_0/05
RP/0/RSP0/CPU0:Mar 21 02:53:52 CET: eem_server[203]: %HA-HA_EM-6-FMS_POLICY_TIMEOUT : Policy 'if_updown_IBM2.tcl' has hit its maximum execution time of 90.000000000 seconds, and so has been halted

As can be seen, I added sime 'puts' to the script in order to show the action to take, in this case should be: interface TenGigE0/0/0/6 -> shutdown

and below is the 'puts' that show the step prior to commit (after doing the shutdown):

noscan PID-71455058: To commit

and then I should see another 'puts' showing 'commited' right after it commits, but I never get there, and I suspect there is an issue with the user eem, which for some reason is not being allowed to perform such command.

Any clue? any missing statement on my config?

shouldn't this line
aaa authorization eventmanager default local

allow eem user to work??

Many thanks in advance!

Marcos.

xthuijs · ‎03-21-2017

ah the script running takes too long, 90 seconds exceeding, so it is crapping out!

if you do a shut manually and commit, how long does that take?

also put a syslog message/puts at the beginning of your script start also so we can see where it is spending time on. debug event manager and show event manager trace

some common improvements, if not used already :) are to use

cli_write $cli(fd) "copy ..."

cli_read_pattern $cli(fd) "confirm"

cli_write $cli(fd) "\r"

also what the anatomy of your script? could you copy paste it?

finally the xr version may matter also as there have been a bug here or there that got fixed :)

xander