cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
303
Views
1
Helpful
5
Replies

NCS540 - Issues with failover time for EVPN

sigbjorn
Level 1
Level 1

In our lab, we are testing EVPN with multihomed active/standby. Both Cisco NCS540s are connected to the same C9300 on the LAN side. Rerouting in WAN and loss of the LAN-link gives a failover in ms. But a reload, or even worse a power-loss, brings the failover time up to 40seconds.

Please see below scenarios and also attached file with config output. Any suggestions to adjust the downtime during a uncontrolled power loss?

### Different failover scenarios ###

EVPN - Multihome - active/standby

--- Failover times ---

#Bringing down active interface Te0/0/0/14 on Router #2

Failover time around 30ms

#Bringing back interface Te0/0/0/14 on Router #2

Failover time around 20ms


#Reroute in WAN between in topology.

Failover time around 125ms

#Reload active router #2 without manually failing over traffic #1.

6-7 seconds outage or failover time

#Router 02 has rebooted and are taking over the traffic again.

Another 6-7 seconds outage or failover time

#Loss of power to router 02.

Around 40 seconds downtime when device is lost and before CE-01 takes over the traffic.

Message below appears when CE-01 takes over.

RP/0/RP0/CPU0:May 30 14:03:40.655 CEST: l2vpn_mgr[1309]: %L2-EVPN-6-MASS_WITHDRAW_RECEIVED : EVPN: Received EAD/ES mass withdraw for Ethernet Segment 0011.1211.1211.1211.1211 IP address 10.100.200.2

# Router 02 comes back.

New outage, similar to controlled reload.

5 Replies 5

nkarpysh
Cisco Employee
Cisco Employee

Hey There,

It can be multiple things but best if we check those starting from most important. Do you have NTP enabled? Thats is highly desirable for such cases and assists in DF election on primary node failure:

https://www.cisco.com/c/en/us/td/docs/iosxr/ncs5500/vpn/76x/b-l2vpn-cg-ncs5500-76x/evpn-features.html#Cisco_Concept.dita_ed289e73-a208-445d-9a15-23987786fe5f

Once NTP is configured further recovery delay can be tweaked with the following timers (no golden standard here as such as it may need to be tweaked based on your topology/setup/gear):

https://www.cisco.com/c/en/us/td/docs/iosxr/ncs5500/vpn/76x/b-l2vpn-cg-ncs5500-76x/evpn-features.html#id_123097

There also can be a certain SW/config issue leading to behavior which may need a TAC support to identify. As this is a lab my recommendation would be to run a version of 7.7.21 for NCS540 or 7.9.21 preferrably.

Niko

HTH,
Niko

Hi,

Thanks for your respond Niko. 

Yes, NTP is configured on all nodes in the LAB. I have noticed around 3 seconds difference in failover time if NTP is not synchronized. My real concern is the time it takes if the primary node gets a power loss, around 40 seconds. This will be the most common scenario in the real life scenario. As you can see in the orginal post, this message appears when backup node takes over.

RP/0/RP0/CPU0:May 30 14:03:40.655 CEST: l2vpn_mgr[1309]: %L2-EVPN-6-MASS_WITHDRAW_RECEIVED : EVPN: Received EAD/ES mass withdraw for Ethernet Segment 0011.1211.1211.1211.1211 IP address 10.100.200.2

All nodes are running 7.7.21. Do you recommend upgrading to 7.9.21? 

S

Can I see your config of EVPN and if you use bgp auto discover I need to see it also

MHM

Hi, 

Thought i added the config as an attachment, but looks like something happened. Adding evpn, BGP and OSPF directly to the message text. 

### Config output ###

LAB-router #1


interface TenGigE0/0/0/14.4011 l2transport
description L2VPN-EVPN-TREX-SA
encapsulation dot1q 4011
!
evpn
startup-cost-in 300
evi 4011
advertise-mac
!
interface TenGigE0/0/0/14
ethernet-segment
identifier type 0 12.12.12.12.12.12.12.12.12
service-carving manual
primary 5001-65000 secondary 1-5000
!
convergence
reroute
!
!
!
!
l2vpn
bridge group BG-L2VPN-EVPN-TREX
bridge-domain BD-L2VPN-EVPN-TREX
interface TenGigE0/0/0/14.4011
!
evi 4011

LAB-router #2

interface TenGigE0/0/0/14.4011 l2transport
description L2VPN-EVPN-TREX-SA
encapsulation dot1q 4011
evpn
startup-cost-in 300
evi 4011
advertise-mac
!
interface TenGigE0/0/0/14
ethernet-segment
identifier type 0 12.12.12.12.12.12.12.12.12
service-carving manual
primary 1-5000 secondary 5001-65000
!
convergence
reroute
!
!
!
!
l2vpn
bridge group BG-L2VPN-EVPN-TREX
bridge-domain BD-L2VPN-EVPN-TREX
interface TenGigE0/0/0/14.4011
!
evi 4011

Edge-switch

interface TenGigabitEthernet1/1/5
description Link to LAB #1 Te0/0/0/14
switchport trunk allowed vlan 100,200,300,900,910,4002,4011
switchport mode trunk
spanning-tree portfast trunk
end

LAB-SC-SA-101#sh run int te1/1/6
Building configuration...

Current configuration : 191 bytes
!
interface TenGigabitEthernet1/1/6
description Link to LAB #2 Te0/0/0/14
switchport trunk allowed vlan 100,300,900,910,4011
switchport mode trunk
spanning-tree portfast trunk
end

### BGP ###

router bgp 65030
router bgp 65030 nsr
router bgp 65030 bfd minimum-interval 15
router bgp 65030 bfd multiplier 3
router bgp 65030 bgp router-id 10.100.200.1
router bgp 65030 update limit 1024
router bgp 65030 update out logging
router bgp 65030 bgp graceful-restart
router bgp 65030 bgp as-path-loopcheck
router bgp 65030 bgp log neighbor changes detail
router bgp 65030 update in error-handling extended ibgp
router bgp 65030 address-family vpnv4 unicast
router bgp 65030 address-family vpnv4 unicast retain route-target all
router bgp 65030 address-family l2vpn evpn
router bgp 65030 neighbor-group IBGP-VPNV4-PEER
router bgp 65030 neighbor-group IBGP-VPNV4-PEER remote-as 65030
router bgp 65030 neighbor-group IBGP-VPNV4-PEER password encrypted
router bgp 65030 neighbor-group IBGP-VPNV4-PEER description ### VPNv4 iBGP PEER GROUP ###
router bgp 65030 neighbor-group IBGP-VPNV4-PEER update-source Loopback0
router bgp 65030 neighbor-group IBGP-VPNV4-PEER address-family vpnv4 unicast
router bgp 65030 neighbor-group IBGP-VPNV4-PEER address-family vpnv4 unicast next-hop-self
router bgp 65030 neighbor-group IBGP-VPNV4-PEER address-family l2vpn evpn
router bgp 65030 neighbor 10.100.200.5
router bgp 65030 neighbor 10.100.200.5 use neighbor-group IBGP-VPNV4-PEER
router bgp 65030 neighbor 10.100.200.5 description LAB-SH-RR-01

### OSPF ###

router ospf 1 router-id 10.100.200.1
router ospf 1 segment-routing mpls
router ospf 1 area 0
router ospf 1 area 0 segment-routing mpls
router ospf 1 area 0 interface Loopback0
router ospf 1 area 0 interface Loopback0 passive enable
router ospf 1 area 0 interface Loopback0 prefix-sid index 2001
router ospf 1 area 0 interface GigabitEthernet0/0/0/23
router ospf 1 area 0 interface GigabitEthernet0/0/0/23 network point-to-point
router ospf 1 area 0 interface GigabitEthernet0/0/0/24
router ospf 1 area 0 interface GigabitEthernet0/0/0/24 network point-to-point
router ospf 1 area 0 interface TenGigE0/0/0/0
router ospf 1 area 0 interface TenGigE0/0/0/0 network point-to-point

Hey,

EVPN failover will be BGP dependent here. And what happens with power loss is that BGP will wait for keepalive to expire before it will react and trigger EVPN convergence. I recommend to play with BGP short timers at least between PE and RR (if you use RR). Or, alternatively, use BFD for fast BGP down detection. IMHO it may help a lot for this failure case.

 

Niko

HTH,
Niko