•1 Glossary

Babu Peddu · ‎06-11-2013

Table of Contents

1 Glossary.. 3

2 Converting Single chassis ASR9K to nV Edge.. 4

2.1 Supported hardware and caveats. 6

2.2 Booting with different images on each chassis. 7

2.3 Configuring the Management Ethernet network for nV edge.. 7

3 nV Edge Control Plane.. 7

3.1 High Redundancy wiring (Recommended). 8

3.2 RSP in each chassis (NOT recommended). 10

3.3 Control Plane UDLD... 11

3.4 Control Link status CLI. 11

3.5 Control Link shut/no shut CLIs. 12

3.6 Miscellaneous control link CLIs. 13

4 nV Inter Rack Link (IRL) connections. 14

4.1 UDLD on IRL links. 15

4.2 What are the IRL links used for?.. 16

4.3 nV IRL “threshold monitor”.. 16

4.3.1 Backup-rack-interfaces config. 17

4.3.2 Specific-rack-interfaces config. 17

4.3.3 selected-interfaces config. 17

4.3.4 What is the default config. 17

4.4 Default QoS on IRL links. 18

4.5 Configurable QoS on IRL interfaces. 21

4.6 IRL packet encapsulation and overhead.. 21

4.7 IRL load balancing.. 22

4.7.1 Ingress IP packet 22

4.7.2 Ingress MPLS packet 22

4.7.3 L2 Unicast 22

4.7.4 L2 Flood.. 22

4.7.5 L3 Multicast 23

5 nV Edge Redundancy model.. 23

5.1 Redundancy switchover: Control Ethernet readiness. 24

5.2 RSP/Chassis failure Detection in ASR9k nV Edge.. 24

6 Split Node.. 26

6.1 All IRL links go away.. 26

6.2 All Control links go away.. 26

6.3 All Control AND IRL links go away.. 27

7 Feature configuration caveats. 28

7.1 Virtual Interfaces (Link bundle / BVI) mac-address. 28

7.2 Link Bundle “switchover suppress-flap” : Rack / Chassis reload.. 28

7.3 IGP protocols and LFA-FRR... 29

7.4 Multicast convergence during RACK reload or OIR... 30

8 Feature Gaps. 31

9 Convergence numbers. 31

10 “Debugging mode” CLIs – cisco support only.. 32

11 nV Edge MIBs. 33

11.1 Redundancy related MIBs. 33

11.1.1 Node Redundancy MIBs. 34

11.1.2 Process Redundancy MIBs. 34

11.1.3 Inventory Management 35

11.2 IRL monitoring MIBs. 37

11.3 Control Ethernet monitoring MIBs. 37

11.4 Control Ethernet Syslog / error messages. 37

11.5 Data Link Syslog / error messages. 38

12 Debugs and Traces. 39

13 Cluster Rack-By-Rack Upgrade.. 39

13.1 Overview.. 39

13.2 Prerequisites. 40

13.3 Upgrade Instructions (Scripted Method). 41

13.3.1 Script Setup.. 41

13.3.2 Script execution. 42

13.3.3 Verification. 42

13.4 Upgrade Instructions (Manual Method). 43

13.5 Install Abort Procedure.. 43

13.6 Converting a nV Edge Cluster to single chassis system... 43

•1 Glossary

nV – Network Virtualization

nV Edge – Network Virtualization on Edge routers

IRL – Inter Rack Links (for data forwarding)

Control Plane – the hardware and software infrastructure that deals with messaging / message passing across processes on the same or different nodes (RSPs or LCs).

Data Plane – the hardware and software infrastructure that deals with forwarding, generating and terminating data packets.

DSC – Designated Shelf Controller (the Primary RSP for the nV edge system)

Backup-DSC – Backup Designated Shelf Controller

UDLD – Uni Directional Link Detection protocol. An industry standard protocol used in Ethernet networks for monitoring link forwarding health.

FPD – Field Programmable Device (fpgas etc.. which can be upgraded).

•2 Converting Single chassis ASR9K to nV Edge

This section assumes that the single chassis boxes are running 4.2.1 or later images, with the latest FPD versions. Check and correct this using the following commands on both chassis:

admin show hw-module fpd location all

admin upgrade hw-module fpd all location all

Find the serial numbers of each chassis. The serial number is found in the “SN:” field in the example below (the FOX.. values), the SN is also visually printed on the chassis.

(admin)#show inventory chassis

NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"

PID: ASR-9006-AC, VID: V01, SN: FOX1435GV1C

NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"

PID: ASR-9006-AC, VID: V01, SN: FOX1429GJSV

One of the chassis will end up being called “Rack0”, the other will be called “Rack1” – there are only two rack numbers possible.

Choose any one of the chassis as “Rack0”. Only on Rack0, enter the below config in admin config mode, this is an example with the above serial numbers.
(admin config) # nv edge control serial FOX1435GV1C rack 0
(admin config) # nv edge control serial FOX1429GJSV rack 1
(admin config) # commit

The above configuration builds a “database” on Rack0 for the chassis serial numbers and assigned rack numbers. One purpose of this is to verify whether a chassis that tries to become part of this nV Edge system is allowed to be part of this nV edge or not, as a security mechanism.

Wire up the control plane connections between the chassis (explained in detail in Section 3) and reload the chassis which is designated as “Rack1”
RSP0
RSP1
RSP1
RSP0
Rack0
Rack1
SFP+0
SFP+0
SFP+0
SFP+0
SFP+1
SFP+1
SFP+1
SFP+1

NOTE: The Control Ethernet cabling should be done only after all the previous steps have been executed and both chassis are ready to “join” an nV Edge system. The control plane network should not be connected before the nV configuration is completed

The Rack1 chassis will reboot, Rack0 will “add” the Rack1 chassis to the nV Edge system after verifying its serial number and Rack1 chassis on booting up will communicate with Rack0. A software versioning check is now carried out and Rack1 will be requested to use the same XR software and/or SMUs that Rack0 has, initiating a reboot as required to achieve this consistency. After this reboot completes, Rack1 becomes part of the nV Edge system.

Now the nV Edge system is booted up perfectly. Any further reboots of either or both of the chassis do not need any further user intervention. The chassis will come up and both of them will “join” the nV Edge system.

NOTE: ALL the interfaces on the chassis having the backup-DSC RSP will be in SHUTDOWN state till at least one Inter-Rack Data Link is in forwarding state. Please refer Section 4.3 for more details.

At any time in the nV Edge system, one of the RSPs (in either Rack0 or Rack1) will be the “master” for the entire nV edge system. Another RSP in the system (again in Rack0 or Rack1) will be the “backup” for the entire nV edge system. The “master” is called a primary-DSC, using CRS Multi chassis terminology. The “backup” is called a backup-DSC. The primary-DSC will run all the primary protocol stacks (OSPF, BGP etc..) and the backup-DSC will run all the backup protocol stacks.

To find out which RSP is primary-DSC and which is backup-DSC, use the below command in admin exec mode.

RP/0/RSP0/CPU0:ios(admin)#show dsc

---------------------------------------------------------

Node ( Seq#) Role Serial# State

---------------------------------------------------------

0/RSP0/CPU0 ( 0) ACTIVE FOX1432GU2Z BACKUP-DSC

0/RSP1/CPU0 ( 1223769) STANDBY FOX1432GU2Z NON-DSC

1/RSP0/CPU0 ( 1279475) ACTIVE FOX1441GPND PRIMARY-DSC

1/RSP1/CPU0 ( 1279584) STANDBY FOX1441GPND NON-DSC

As can be seen above, the Rack1 RSP0 (1/RSP0/CPU0) is the primary-DSC and Rack0 RSP0 (0/RSP0/CPU0) is the backup-DSC. The Primary and Backup DSCs do not have any “affinity” towards any one chassis or any one RSP. Whichever chassis in the nV edge system boots up first will likely select one of its RSPs as the primary-DSC.

Another matter to note is that the “Active” / “Standby” states of the RSPs, which are familiar concepts in the single chassis mode of operation, are superseded by the primary-DSC backup-DSC functionality in an nV Edge system. For example, in a single chassis system, protocol stacks used to run on the Active and Standby RSPs in a single chassis as primary/backup protocol stacks. But as discussed in the preceding paragraph, that is no more the case in an nV Edge system – in nV edge, the primary-DSC and backup-DSC run the primary and backup protocol stacks.

•2.1 Supported hardware and caveats

•1. Only Enhanced Ethernet Linecards (Typhoon) and SIP-700 (Thor) line cards are supported in the chassis. Older Ethernet linecards (Trident) will not work
•2. Only Enhanced Ethernet Linecards (Typhoon) can be used for IRL. Support for 100G IRL will happen in second half oc CY2013.
•3. Only same chassis types can be connected to form an nV edge system
•4. ASR9001 is also supported as an nV Edge system. In terms of High Availability functionality the 9001 chassis will support full nV Edge HA in a later release, so its expected to see an outage of about 30 seconds during a chassis shutdown or failover. Also some of the show commands used in this document might not appear on 9001 until a 5.1.0 release
•5. nV edge on the 9922 chassis is in 4.3.1
•6. Only Cisco supported SFPs allowed for all IRL connections
•7. The RSP front panel control plane SFPs HAVE TO BE 1Gig SFPs. 10Gig SFPs are NOT supported.

•2.2 Booting with different images on each chassis

In an nV Edge system, if for whatever reason, both chassis end up having non-identical XR software and/or SMUs installed, this could happen if a system is forced to boot a particular image by a ROMMON setting, then the chassis that boots up later will tell the dSC chassis(normally rack0) about its version details – the dSC chassis will “reject” that version if it doesn’t match.

•2.3 Configuring the Management Ethernet network for nV edge

Like a single chassis, one can configure the MgmtEth. interfaces on the nV edge cluster, the question often is which subnet to put the 4 interfaces in and what are the available options? Actually three options are available:

•- Flat management: Have all interfaces on one subnet, and use one virtual IP address to access the nV cluster.
•- Per Chassis Management (global and VRF): Have each chassis/Rack in their own network, and have one virtual address in the global table, and another virtual address/VRF for the second chassis.
•- Per Chassis Management (both VRF): Have both chassis each in their own VRF with a unique virtual address to each VRF.

•3 nV Edge Control Plane

The nV Edge control plane provides software and hardware extensions to create a “unified” control plane for all the RSPs and line cards on both the nv Edge chassis. The control plane packets are forwarded from chassis to chassis “in hardware” as you will see in sections below. Control plane multicast etc.. is done in hardware for both the chassis – so there is no control plane performance impact because there are two chassis instead of one.

•3.1 High Redundancy wiring (Recommended)

The nV Edge control plane links have to be direct L1 connections, there is no network or intermediate routing / switching devices allowed in between. Some details of the control plane connections are provided below to provide a better understanding of what exactly is the reasoning behind our recommendations. The control Ethernet links (front panel SFP+ ports) are configured in 1Gig mode of operation.

RSP0

RSP1

RSP0

Rack0

Rack1

SFP+0

SFP+1

As seen in the diagram above, each RSP in each chassis has an Ethernet switch to which all the CPUs in the system (Line Card CPUs, RSP CPUs, any other CPUs in the system) connect to. So each CPU connects to two switches – one on each RSP. At any point in time, only one of the switches will be “active” and switching the control plane packets, the other will be “inactive” (regardless of whether system is nV edge or single chassis). And the “active” switch can be on either of the RSPs in the chassis, whichever switch can ensure the best connectivity across all the CPUs in the system.

The two SFP+ front panel ports on RSP-440 are just direct ports plugging into the switch on the RSP. So as shown in the diagram, of an nV Edge system, the simple goal is to connect each RSP (switch inside the RSP) to each switch on the remote chassis. So in the above case if any of the links go down, there are three possible backup links. Also at any point in time, only one of the links will be used for forwarding control plane data, all the other three links will be in “standby” state.

Connecting two chassis with just 2 EOBC links ie, RSP0 to RSP0 RSP1 to RSP method is NOT recommended and discouraged against, as it doesn’t provide the required resilience.

The control Ethernet is the heart of the system – if there is anything wrong with it, it can seriously degrade the nV edge system. So it is HIGHLY recommended to use all four control Ethernet links.

Here is a view of an RSP440 and 9001 EOBC ports, these ports cannot be used for anything other than EOBC, they cannot be used or configured as a L2 or L2 data port

•3.2 RSP in each chassis (NOT recommended)

In the case of a single RSP-per-chassis nV Edge topology, the below will be the wiring model. But again, this is not recommended because of resiliency reasons. If the only RSP in a chassis goes down, the entire chassis and all the line cards in the chassis also go down.

RSP0

Rack0

Rack1

SFP+0

SFP+1

•3.3 Control Plane UDLD

UDLD runs on the control plane links to ensure bi-directional forwarding health of the links. The UDLD is run at 200 msecs interval x 5 - ie, an expiry interval of 1 second. Which means that if a control link is uni-directional for 1 second, then the RSPs will take action to switch the control plane link to one of the three standby links.

Note that the one second detection is only for unidirectional failures – for a physical link fault (like fiber cut), there will be interrupts triggered with the fault and the link switchover to the standby links will happen in milliseconds.

•3.4 Control Link status CLI

The front panel SFP+ ports are referred to as ports “0” and “1” in the show command below. So each RSP has two of these ports, and the command below shows which port on which RSP is connected to which other port on which other RSP.

In the example below:

Port “0” on 0/RSP0 is connected to port “0” on 1/RSP0.

Port “1” on 0/RSP0 is connected to port “1” on 1/RSP1

Port “0” on 0/RSP1 is connected to port “0” on 1/RSP1

Port “1” on 0/RSP1 is connected to port “1” on 1/RSP0

Also, the “port pair” that is “active” and used for forwarding control Ethernet data is the link between port “12” on 0/RSP0 and port “12” on 1/RSP0 as shown in the state Forwarding below. All other links are just backup links.

The “CLM table version” is also a useful number to note. This number if it changes means that the control link UDLD is flapping. So in a good “stable” condition, that number should not change.

RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/RSP0/CPU0

Priority lPort Remote_lPort UDLD STP

======== ===== ============ ==== ========

0 0/RSP0/CPU0/0 1/RSP0/CPU0/0 UP Forwarding

1 0/RSP0/CPU0/1 1/RSP1/CPU0/1 UP Blocking

2 0/RSP1/CPU0/0 1/RSP1/CPU0/0 UP On Partner RSP

3 0/RSP1/CPU0/1 1/RSP0/CPU0/1 UP On Partner RSP

Active Priority is 0

Active switch is RSP0

CLM Table version is 2

•3.5 Control Link shut/no shut CLIs

Each RSP has two front panel EOBC link which are numbered as 0 and 1. The CLI to shut the links is as below

RP/1/RSP0/CPU0:A9K-Cluster-IPE(admin-config)#nv edge control control-link disable <0-1 > location <>

On shutting a control port, the CLI will also set a rommon variable on that RSP like “CLUSTER_0_DISABLE = 1” if port 0 is disabled and “CLUSTER_1_DISABLE = 1” if port 1 is disabled. As long as this rommon variable is set, neither rommon nor IOS-XR will ever enable that port.

The behavior when ALL the control links are shut is obviously that both chassis become DSC. But if the IRL links are active, then one of the chassis will reload and again, as soon as the IRL link comes back up, it will again reboot.

Currently this is the recommended procedure if all the control links are shutdown...

•1. Shut down the IRL links from one of the chassis (whichever chassis doesn’t reboot, remember one chassis comes up and reboots). This will get both chassis to stay UP.
•2. Reload one chassis and keep BOTH the RSPs in rommon and unconfigure a rommon variable as below, do this on BOTH the RSPs
•a. rommon> unset CLUSTER_0_DISABLE
•b. rommon> unset CLUSTER_1_DISABLE
•c. rommon> sync
•d. rommon> reset
•3. On the other chassis which is still in XR, go to admin config and say “no nv edge control control-link disable <port> <location>” for each port and location where the port was shutdown.
•4. On the RSPs in rommon, say the below
•a. rommon> boot mbi:

NOTE: The above is indeed a cumbersome and lengthy procedure (but only if we shut all control links). In 4.2.3 the procedure to unshut would be very simple – on whichever chassis that doesn’t reboot, go to admin config mode and just enter “no nv edge control control-link disable <port> <location>” and that will automatically take care of syncing it with the other chassis also.

•3.6 Miscellaneous control link CLIs

•1. show nv edge control control-link-port-counters – this CLI displays the Rx/Tx packet statistics through the EOBC front panel ports (0 or 1)

•2. show nv edge control control-link-sfp – this CLI dumps the SFP EEPROM that’s plugged into the front panel port. In addition it provides the data below

SFP Plugged in : 0x00000001 (1)

SFP Rx LOS : 0x00000000 (0)

SFP Tx Fault : 0x00000000 (0)

SFP Tx Enabled : 0x00000001 (1)

The “SFP Plugged in” should be value 1 if there is an SFP present. The “SFP Rx LOS” should be 0 or else there is Rx Loss of Signal (an error !). The “SFP Tx Fault” should be 0 or else there is an SFP Fault (an error !). The “SFP Tx Enabled” should be 1 or else the SFP is not enabled from the control Ethernet driver (also an error !).

Supported EOBC SFPs.

In 4.2.1

SFP-GE-S=

1000BASE-SX SFP (DOM), MMF, 550/220m

In 4.3.0

SFP-GE-S=	1000BASE-SX SFP (DOM), MMF, 550/220m
GLC-SX-MMD=	1000BASE-SX SFP, MMF, 850nm, 550m/220m, DOM

•3. show nv edge control control-link-debug-counts – this is mostly for Cisco engineering support debugging. Values that might be of interest are as below

Admin UP : 0x00000001 (1)

SFP supported cached : 0x00000001 (1)

PHY status register : 0x00000070 (112)

The “Admin UP” 0 would mean that customer has configured “nv edge control control-link-disable <port> <location>” CLI. Without that config, it should be value 1 which is the default. The “SFP supported cached” indicates whether user plugged in a Cisco supported SFP – value 1 means the SFP is supported, 0 means SFP is not supported. If the control link has an SFP plugged in and has a cable connected to a remote end and the remote end is also up and laser is good, link is good etc.., then the “PHY status register” should have a value of 0x70, it is an internal PHY register which says that the link is all good. If there is no cable or no SFP or bad cable or bad link etc.., it will not be value 0x70, this can be sometimes useful for Cisco support during debugging.

•4 nV Inter Rack Link (IRL) connections

The IRL connections are required for forwarded traffic going from one chassis out of interface on the other chassis part of the nV edge system. The requirement for the IRL link is that it has to be a 10 Gig link and that they have to be direct L1 connections – no routed/switched devices are allowed in between. There can be a maximum of 16 such links between the chassis. Also recommended is a minimum of 2 links to offer resiliency(section 4.7 discusses load balancing across links), also that the two links be on two separate line cards, again for resiliency reasons in case one line card goes down due to any fault. The number of IRL links will need to be considered, this is based on the number of cards in the system, the expected traffic over IRL during a failure.

The configuration of an interface as IRL is simple, as shown below:

interface tenGigE 0/1/1/1

nv

edge

interface

!

Add this config to the IRL interfaces on both chassis of course! We run UDLD over these links to monitor bi-directional forwarding health.. Only when UDLD reports that the echo and echo response are all fine (standard UDLD state machine), then we place the interface into “Forwarding” state, till then the interface is in “Configured” state. So the IRL interface might be “Configured” but not “Forwarding”, once its both, then it will be used for forwarding the data across chassis.

RP/0/RSP0/CPU0:ios#show nv edge data forwarding location 0/RSP0/CPU0

nV Edge Data interfaces in forwarding state: 1

tenGigE 0_1_1_1 <--> tenGigE 1_1_0_1

nV Edge Data interfaces in configured state: 2

tenGigE 1_1_0_1

tenGigE 0_1_1_1

The above CLI says that there are two IRLs in “Configured” state (marked blue) – of course one on each Rack. The CLI also says that there is one “pair” of IRLs in “Forwarding” state (marked green). The “pair” is one from each rack. So the UDLD protocol automatically detects which interface is connected to which other and forms a “pair”.

So if you have configured IRLs, but you don’t see the line “nV Edge Data interfaces in forwarding state:” in your CLI output, then that means that something is wrong. We would recommend going through the standard interface checklist

-> Are the cables and SFPs all good ?

-> Are the interfaces unshut and Up/Up ?

-> Are there interface drops or errors ?

-> If you are conversant with the packet path, are there any other packet path drops ?

•4.1 UDLD on IRL links

The UDLD timers on the IRL links are set to 40 milliseconds times 5 hellos, ie around 200 msecs as the expiry timeout. That means that any uni-directional problem with the IRL links will be detected & corrected in around 250 msecs (200 msecs + delta for processing overheads).

If you want to see the UDLD state machine on the line card hosting these links, then the below CLI can be used. The Interface [number in red] is what we call the “ifhandle”. The interface name corresponding to that can be displayed using the CLI “show im database ifhandle <number in red> location <line card>”.

In the example below, the UDLD state is Bidirectional, which is the desired correct state when things are working fine.

RP/0/RSP0/CPU0:ios#show nv edge data protocol all location 0/1/cPU0

Interface [0x60002c0][769]

---

Port enable administrative configuration setting: Enabled

Port enable operational state: Enabled

Current bidirectional state: Bidirectional

Current operational state: Advertisement - Single neighbor detected

Message interval: 20 msec

Time out interval: 10000 msec

Entry 1

---

Expiration time: 140 msec

Device ID: 1

Current neighbor state: Bidirectional

Device name: CLUSTER_RACK_01

Port ID: [0x46000100][769]

Neighbor echo 1 device: CLUSTER_RACK_00

Neighbor echo 1 port: [0x60002c0][769]

Message interval: 20 msec

Time out interval: 100 msec

CDP Device name: ASR9K CPU

•4.2 What are the IRL links used for?

The IRL links are used for forwarding packets whose ingress and egress interfaces are on separate racks. They are also used for all protocol Punt packets and protocol Inject packets. As explained in Section 2, the protocol stack “Primary” runs on the primary-DSC RSP in one of the chassis. So if a protocol punt packet comes in on an interface in another chassis, it has to be punted to the primary-DSC RSP in the remote chassis. This punt is done via the IRL. Similarly if the protocol stack on the primary-DSC wants to send a packet out of an interface on another chassis, that is also done via the IRL interfaces.

•4.3 nV IRL “threshold monitor”

If the number of IRL links available for forwarding goes below a certain threshold, that might mean that the remaining IRLs will get congested and more and more inter-rack traffic will get dropped. So the IRL-monitor provides a way of shutting down other ports on the chassis if the number of IRL links goes below a threshold. The commands available are below

RP/0/RSP0/CPU0:ios(admin-config)#nv edge data minimum <minimum threshold> ?

backup-rack-interfaces Disable ALL interfaces on backup-DSC rack

selected-interfaces Disable only interfaces with nv edge min-disable config

specific-rack-interfaces Disable ALL interfaces on a specific rack

There are three modes of configuration possible.

•4.3.1 Backup-rack-interfaces config

With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on whichever chassis is hosting the backup-DSC RSP will be shut down. Again note that the backup-DSC RSP can be on either of the chassis.

•4.3.2 Specific-rack-interfaces config

With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on the specified rack (0 or 1) will be shut down.

•4.3.3 selected-interfaces config

With this configuration, if the number of IRLs go below the <minimum threshold> configured, the interfaces on any of the racks that are explicitly configured to be brought down will be shut down. How do we “explicitly” configure an interface (on any rack) to respond to IRL threshold events ?

RP/0/RSP0/CPU0:ios(config)#interface gigabitEthernet 0/1/1/0

RP/0/RSP0/CPU0:ios(config-if)#nv edge min-disable

RP/0/RSP0/CPU0:ios(config-if)#commit

So in the above example, if the number of IRLs go below the configured minimum threshold, interface Gig0/1/1/0 will be shut down.

•4.3.4 What is the default config

The default config (if customer does not configure any of the above explicitly) is the equivalent of having configured “nv edge data minimum 1 backup-rack-interfaces”. Which means that if the number of IRLs in forwarding state goes below 1 (at least 1 forwarding IRL), then ALL the interfaces on whichever rack that has the backup-DSC, will get shut down. Meaning all traffic on that rack will stop being forwarded.

This might make some customers happy, some unhappy. This behavior can be turned off or changes through the following CLI “nv edge data minimum 0 backup-rack-interfaces” – basically this says that if the number of IRLs in forwarding state goes below 0 (which will never happen), only then we should bother shutting any interface on any rack.

•4.4 Default QoS on IRL links

When an interface is configured as an IRL link, we install 5 absolute priority queues on the port in both the ingress and egress directions. The priorities are below

•1. All protocol punt / inject packets like protocol Hellos etc..
•2. Multicast traffic
•3. Fabric priority 0 traffic
•4. Fabric priority 1 traffic
•5. Fabric priority 2 traffic

The IRL links do not allow “user configurable” MQC policies on the IRL interfaces themselves. The classification of “punt / inject” and “multicast” are done “internally” in microcode – that is, other than being a punt/inject or multicast packet, there is no way by which we can “influence/force” a packet to go into the first two queues.

What packet gets into the last three queues can be influenced – just by having QoS ingress policies that mark packets appropriately to be a cos value of 0, 1 or 2. There is no other way by which we can influence what gets into these queues. The queue id selected on the ingress chassis’s IRL links is carried across in the Vlan COS bits, the egress chassis’s IRL that gets this packet will use this queue id encoded in the Vlan COS to select the queues it uses on Ingress (when it receives the packets from the remote chassis).

The CLI to display the nV edge qos queues is as below for example using an IRL interface with configs below. The subslot number 0 in the example is the “subslot” in which the MPA (the pluggable adaptor) is on the MOD-80/160 line card in the ASR9K. If the line card is not of a type that supports pluggable adaptors, just use 0 for subslot. The port number 1 used in the example is simply the last number in the 1/1/0/1 notation.

The drops (if any) in these queues are aggregated and reflected in the “show interface” drops also. The standard interface MIBs can be used for monitoring these drops. Note that the individual queue drops are not exported to MIBs, only the aggregate drops are exported as the interface drops. Also the IRL links are just regular interfaces, so the regular interface MIBs will all work on IRLs also.

RP/0/RSP0/CPU0:ios#sh running-config interface gigabitEthernet 1/1/0/1

interface GigabitEthernet1/1/0/1

nv

edge

interface

!

RP/0/RSP0/CPU0:ios#show qoshal cluster subslot 0 port 1 location 1/1/cPU0

Cluster Interface Queues : Subslot 0, Port 1

===============================================================

Port 1 NP 0 TM Port 17

Ingress: QID 0xa8 Entity: 0/0/0/4/21/0 Priority: Priority 1 Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f0348/0x0/0x5f0349

Statistics(Pkts/Bytes):

Tx_To_TM 681762/140538069

Total Xmt 681762/140538069 Dropped 0/0

Ingress: QID 0xa9 Entity: 0/0/0/4/21/1 Priority: Priority 2 Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f034d/0x0/0x5f034e

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Ingress: QID 0xab Entity: 0/0/0/4/21/3 Priority: Priority 3 Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f0357/0x0/0x5f0358

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Ingress: QID 0xaa Entity: 0/0/0/4/21/2 Priority: Priority Normal Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f0352/0x0/0x5f0353

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Ingress: QID 0xac Entity: 0/0/0/4/21/4 Priority: Priority Normal Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f035c/0x0/0x5f035d

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Egress: QID 0xc8 Entity: 0/0/0/4/25/0 Priority: Priority 1 Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f03e8/0x0/0x5f03e9

Statistics(Pkts/Bytes):

Tx_To_TM 3372382/697778537

Total Xmt 3372382/697778537 Dropped 0/0

Egress: QID 0xc9 Entity: 0/0/0/4/25/1 Priority: Priority 2 Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f03ed/0x0/0x5f03ee

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Egress: QID 0xcb Entity: 0/0/0/4/25/3 Priority: Priority 3 Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f03f7/0x0/0x5f03f8

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Egress: QID 0xca Entity: 0/0/0/4/25/2 Priority: Priority Normal Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f03f2/0x0/0x5f03f3

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

Egress: QID 0xcc Entity: 0/0/0/4/25/4 Priority: Priority Normal Qdepth: 0

StatIDs: commit/fast_commit/drop: 0x5f03fc/0x0/0x5f03fd

Statistics(Pkts/Bytes):

Tx_To_TM 0/0

Total Xmt 0/0 Dropped 0/0

RP/0/RSP0/CPU0:ios#

•4.5 Configurable QoS on IRL interfaces

To support more flexible QoS options for customers who want more than the default QoS mentioned in Section 4.4, we provide an option for configuring regular MQC policies on the EGRESS direction (There is no ingress support) with some limitations. The limitation in one simple sentence is that the MQC policy configured on an IRL does not have the ability to access the packet contents – that is, there is no way to figuring out whether the packet that goes out on IRL is ipv4 or ipv6 etc.. So none of the MQC features that need to look into the packet will work. So how exactly is it used ?

Typical use case is that customer will configure an ingress MQC policy map on any regular (non-IRL) ingress interface. That ingress MQC policy can parse the packet and set a “qos-group” for the packet. The egress IRL policy-map can then match on this qos-group and apply features like queuing and shaping. Random detect can also be applied (not based on dscp though – remember that needs access to packet contents) and of course marking is not supported.

The user is not prevented from applying any MQC policy on the IRL regardless of whether that policy has features unsupported on the IRL or not. There is no config level rejection of policies done on the IRL interface yet (this might be enforced in later releases), so user has to take care to configure only supported features or else the behavior is unpredictable. For example if user configures an egress MQC policy on the IRL that does marking, then the packet going out of the IRL will have contents changed in some random location and that might cause those packets to be dropped in the node or at the host!

The configuration of MQC on IRL and the show commands etc.. are exactly the same as MQC on a regular interface (remember IRL is just a regular interface !).

•4.6 IRL packet encapsulation and overhead

The packet that goes out on the IRL will have a Vlan encapsulation with vlan hard-coded to vlan-id 1. The vlan-id really doesn’t matter, we just use the vlan COS bits to carry over the packet priority as mentioned in section 4.4. So that is 18 bytes overhead. In addition there is around 24 bytes of overhead, which depends very much on the kind of packet (l3 / l2 / mcast etc..) being transported. So on average we have around 42 bytes overhead.

•4.7 IRL load balancing

IRL load balances packets based on flow. How a “flow” is defined varies from feature to feature. In general, for any given feature, if we ask the question “how does this feature packet get load balanced across link bundle members”, the same answer would apply to load balancing across IRLs also. In other words, IRL load balancing obeys the exact same principles as link bundle member load balancing. In other words, a “32 bit” hash value is calculated for each packet/feature and that 32 bit hash value (with some bit flips etc.. to avoid polarization) would get used for IRL load balancing as well as link bundles.

Let us examine the different kinds of features briefly. This is by no means meant to be an exhaustive documentation of all the load balancing algorithms on the router, rather just to give an overview of the major classes of load balancing.

•4.7.1 Ingress IP packet

This is the standard tuple used for hash calculation for load balancing across link bundle members – like the source ip, dest ip, source port, dest port, protocol type. It does not matter whether the egress is IP or MPLS, the ingress is all that matters

•4.7.2 Ingress MPLS packet

If the incoming packet is MPLS, the forwarding engine looks deeper to see if the underlying packet is IP. If it is IP, then the standard IP hash tuple is used for calculating the hash. If the underlying packet is not IP, then just the labels from the label stack are used for calculating the hash. The label allocation mode (per CE or per VRF) has no impact on the hash.

•4.7.3 L2 Unicast

There load balancing will be done based on src/dst mac addresses. Again, as explained initially this doesn’t become an exhaustive answer because there are scenarios where the VC label hash is used in vpls scenario.

•4.7.4 L2 Flood

For L2 flood traffic over link bundles, there are multiple elaborate modes of load balancing, the exhaustive documentation is probably best referred to along with the L2 link bundle documentation. But in general, there are two modes of load balancing that is tied to the flooding mode in L2.

•4.7.4.1 Flood optimized mode

In this mode, to restrict the L2 floods from reaching too many line cards, the hash is “statically” chosen based on bridge group. So some bridge groups will be “tied” to one IRL, few others to another IRL – same behaviour chosen for L2 over link bundles.

•4.7.4.2 Convergence / Resiliency mode

In this mode, the L2 flood is hashed in ucode based on the src/dst mac addresses.

•4.7.5 L3 Multicast

L3 Multicast hashes multicast flows based on (S,G) and uses that hash to distribute packets across the IRLs – again the same technique used for distributing multicast packets across link bundle members.

•5 nV Edge Redundancy model

There are four very simple rules that can always help in determining the primary-DSC and backup-DSC RSPs in an nV edge system.

•1. Primary-DSC and backup-DSC both are always the “Active” RSP in each chassis. The “Active” here refers to the “Active” we know in the context of a single chassis ASR9K – where one RSP is “Active” and another is “Standby”

•2. Primary-DSC and backup-DSC will always be on RSPs in different chassis.

•3. If a Primary-DSC goes down, then the backup-DSC becomes primary-DSC. The chassis which hosts the Primary-DSC is the DSC chassis.

•4. If any RSP other than the primary-DSC or backup-DSC goes down, there is no change in the state of the primary-DSC or backup-DSC.

With these four rules in place, in any give scenario, we can figure out what happens if any of the RSPs in any of the chassis go down.

•5.1 Redundancy switchover: Control Ethernet readiness

Before issuing redundancy switchover, it’s a good practice to check the control links in the system and check that there is at least one backup link available that can take over. For example in the output below, if we decide to issue “redundancy switchover” on 0/RSP0/CPU0, we have three more links (shown as “Blocking” nor as “On Partner RSP”) and one of them can take over as the link connecting control planes of both chassis (see Section 3.1 for details).

Sometimes it might happen that because of some fault (say fiber cut or bad sfp etc..), a few links are down in which case you won’t see those links (neither as “Blocking” nor “On Partner RSP”). So unless there is at least one backup link, if we issue a switchover, then the only link that is “Forwarding” will go away and there won’t be any more control plane connectivity across the chassis.

NOTE: We are enhancing the “redundancy switchover” CLI to automatically check this condition and disallow the cli to go through if there are no backup links. Until this enhancement is implemented, it is recommended to do this manual procedure.

RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/RSP0/CPU0

Priority lPort Remote_lPort UDLD STP

======== ===== ============ ==== ========

0 0/RSP0/CPU0/12 1/RSP0/CPU0/12 UP Forwarding

1 0/RSP0/CPU0/13 1/RSP1/CPU0/13 UP Blocking

2 0/RSP1/CPU0/12 1/RSP1/CPU0/12 UP On Partner RSP

3 0/RSP1/CPU0/13 1/RSP0/CPU0/13 UP On Partner RSP

•5.2 RSP/Chassis failure Detection in ASR9k nV Edge

In an ASR-9k nV Edge system, on failure of the Primary DSC node the RSP in the Backup DSC role becomes Primary, with the duties of being the system “master” RSP and hosting the active set of control plane processes. In the normal case for nV Edge, the Primary and Backup DSC nodes are hosted on separate racks. This means that the failure detection for the Primary DSC occurs via communication between racks.

The following mechanisms are used to detect RSP failures across rack boundaries:

•1) FPGA state information detected by the Peer RSP in the same chassis is broadcast over the control links. This is sent if any state change occurs, and periodically every 200ms.
•2) The UDLD state of the inter-chassis control links to the remote rack, with failures detected at 500ms
•3) The UDLD state of the inter-chassis data links to the remote rack, failures detection at 500ms (clarify with Micah)
•4) A keep-alive message sent between RSP cards via the inter-chassis control links, with a failure detection time of 10 seconds.

Additionally messages are sent between racks for the purpose of Split Node avoidance / detection. These occur at 200ms intervals across the inter-chassis data links, and optionally can be configured redundantly across the RSP Management LAN interfaces. Refer to section 6.5 below.

Example HA Scenarios:

•1. Single RSP Failure of the Primary DSC node

The Standby RSP within the same chassis initially detects the failure via the backplane FPGA. On failure detection this RSP will transition to the active state and notify the Backup DSC node of the failure via the inter-chassis control link messaging.

•2. Failure of Primary DSC node and its Standby peer RSP.

There are multiple cases where this case can occur, such as power-cycle of the Primary DSC rack or simultaneous soft reset of both RSP cards within the Primary rack.

The remote rack failure will initially be detected by UDLD failure on the inter-chassis control link. The Backup DSC node checks the state if the UDLD on the inter-chassis data link. If the rack failure is confirmed by failure of the data link as well, then the Backup DSC node becomes active.

UDLD failure detection occurs in 500ms, however the time between control link and data link failure can vary since these are independent failures detected by the RSP and LC cards. A windowing period of up to 2 seconds is needed to correlate the control and data link failures, and to allow for split-brain detection messages to be received.

The keep-alive messaging between RSP acts as a redundant detection mechanism, should the UDLD detection fail to detect a stuck or reset RSP card.

•3. Failure of Inter-Chassis control links (Split Node)

Failure is initially detected by the UDLD protocol on the Inter-Chassis control links. Unlike the rack reload scenario above, the Backup DSC will continue receiving UDLD and keep-alive messages via the inter-chassis data link. Similar to the rack reload case, a 2 second windowing period is allowed to correlate the control/data link failures. If after 2 seconds the data link has not failed, or Split Node packets are being received across the Management LAN then the Backup DSC rack will reload to avoid the Split Node condition.

•6 Split Node

There are primarily two sets of links connecting the chassis in the nV edge system.

•1. Control links (recommended four of them)
•2. IRL links (minimum one)

So the two sets of links together will be at least FIVE wires. Let us see what can happen when there is a fault and a complete set of control links or IRL links or both go away (become faulty ?)

FOUR Control Plane Links

At least one IRL link

Chassis1

Chassis0

•6.1 All IRL links go away

In this case, refer to Section 4.3 – both chassis will be up and functioning, but the interfaces on one of the chassis “might” get shut-down based on what config is present on the box (or whether its just the default config). Again, Section 4.3 should be referred to to understand what config is appropriate for you.

•6.2 All Control links go away

The two chassis in the nV edge system cannot function as “one entity” without control links. We have beacons that each chassis periodically exchanges over the IRL links. So if control links go down, then each chassis will know via the IRL beacons that the other chassis is UP, and one of the chassis has to just take itself down and go back to rommon.

Which chassis should go back to rommon ? The logical choice is the chassis hosting the Primary DSC RSP stays up, and the Non-Primary rack resets. Reason being that the chassis hosting the primary-DSC has all the “primary” protocol stacks and hence we want to avoid disturbing the protocols as much as possible. So we take the non-primary rack down to rommon and it tries to boot and join the nV edge system again – at some point if one or more control links become healthy again, that chassis will bootup and join the nV edge system again.

Since IOS-XR cannot stabilize with the control links severed in this way, the non-primary rack will continue to bootup, detect that the control links are down and reset until the connectivity issue is resolved.

The CLI command “show nv edge control control-link-protocols” can be used to assess the current status of the control links in the event of a problem.

•6.3 All Control AND IRL links go away

In this scenario, we can “potentially” enter what is called a “Split Brain” – where each chassis thinks that the other chassis has gone down and each of them declares itself as the master. So protocols like OSPF will start having two instances each with the same router-id etc.. and that can be a problem for the network.

So to try and mitigate this scenario, we provide one more set of “last gasp” paths via the management LAN network. On EACH RSP in the system, we should connect one of the two management LAN interfaces (any one of them) to an L2 network so that all four of those interfaces (from each RSP) can send L2 packets to each other. Then we can enter the below configuration on each of those management LAN interfaces.

interface MgmtEth0/RSP0/CPU0/1

nv

edge

split-brain

!

So what this will do is that on each RSP, we will send high frequency beacons on these interfaces at 200 millisecond intervals. So if both chassis are functional, both chassis will get beacons from the other. And in such a scenario, if both chassis comes to know that both of them are working independently, then they know it’s a problematic scenario and one of them will take itself down. The chassis to reset will be the one that has been in the primary state for the least amount of time.

So this “Split Node” management lan path provides yet another alternate path to provide additional resiliency to try and avoid a nasty “Split Node” scenario.

But if the Control links AND IRL links AND split-brain management LAN links ALL of them go away, then there’s no way to exchange any beacons across the chassis and then we will enter the split-brain scenario where both chassis starts functioning independently. In scenario such that the mgmt network on both chassis are not in the same subnet, or not in the same location, a L2 connection should be facilitated to provide the last gasp.

NOTE: The Split Node interface messages are meant to be “best effort” messages, currently we do not monitor for the “health” of those links. Those links are regular Management Ethernet interfaces and will have all the usual UP/DOWN traps etc.. But for example there are intermittent monitoring message drops on those links, then we do not raise any alarm or complaint. We might enhance this in future to include some monitoring of the packet drops (if any) on these links to alert the user.

•7 Feature configuration caveats

•7.1 Virtual Interfaces (Link bundle / BVI) mac-address

The link bundle / BVI configuration on nV Edge requires a manual configuration of mac-address under the interface. An example for link bundle shown below

interface Bundle-Ether15

mac-address 26.51c5.e602 <== A mac like this needs to be configured explicity

Also for link bundle, the below lacp global configuration is also required

lacp system mac 0201.debf.0000

This caveat / requirement will be fixed in later release, till then we need to have this configuration for link bundles / BVIs / any virtual interfaces to work on nV Edge system.

•7.2 Link Bundle “switchover suppress-flap” : Rack / Chassis reload

interface Bundle-Ether15

lacp switchover suppress-flaps 15000

The “bundle manager” is a process that runs on the primary (DSC) and backup (backup-DSC) RSPs and is responsible for the configuration and state maintenance of the link bundle interfaces. When the primary (DSC) chassis in an nV Edge system is reloaded, the bundle-manager on the backup-DSC needs to “go active” and start connections to some external processes that provide other services (ICCP as an example). A Chassis reload is a much more “heavy” operation compared to a regular RSP switchover because a chassis reload involves the restart of all RSPs and all line cards on that chassis and this cause quite a lot of control plane churn compared to a regular rsp switchover where theres only one node that goes away (one rsp). For example the basic infrastructure processes that handle the IPC (Inter Process Communication) in the system has to do a lot of “Cleanups”, they have to cleanup data structures corresponding to all the nodes that went away and flush packets from/to those nodes etc.. The routing protocols / rib has to process a lot of interface down notifications and start NSF / GR Etc.. Owing to this additional control plane load, when the bundle-manager asks for connecting to external “services”, those services will take more time to respond because they are already busy processing node down events.

Hence, the bundle-manager process might be “blocked” for a longer period of time compared to a regular swover scenario. So during this “blocked” time period, the remote end might time out and declare the bundle down. To prevent this, we have the “lacp switchover suppress-flap <seconds>” command. This needs to be configured on the nV Edge system AND the remote boxes (if remote is not IOS-XR box, whatever is the equivalent of that config in that box). This basically tells the link bundle to tolerate more control packet losses during this period.

In the example here, we have configured a 15 second tolerance – note that this DOES NOT mean that there will be a 15 second packet drop. Bundle manager will update the data plane to use a newly active link as soon as it gets the event which decides who is active (notification from peer in case of MC-LAG) and data can start flowing. All this does is to prevent bundle from going down if the rest of the bundle manager control plane is busy doing other stuff (like connecting to services) while the peer is expecting some control packets Rx/Tx.

•7.3 IGP protocols and LFA-FRR

ASR9K nV Edge High Availability mode is unique in that it is probably the only High Availability model where we “expect” topology changes during a Backup to Primary Switchover like during a Rack / Chassis reload. If the Primary (DSC) chassis is reloaded, and if that chassis had IGP interface(s) on its line card(s), then when the Backup-DSC takes over as Primary-DSC, it has to do switchover processing AND at the same time process topology changes due to the loss of interfaces.

But as we know, for handling switchover cases gracefully, it is normal that customers configure Non Stop Forwarding (NSF) under IGP protocols like ISIS and OSPF. So now when the DSC Chassis is reloaded, the new DSC (old backup-DSC) will immediately start NSF on IGP (say ISIS) and as we know about regular NSF, it can take many seconds (default 90 seconds, can be changed by the nsf lifetime CLI) for NSF to be completed and the RIB will be informed about topology changes only AFTER NSF is complete.

So during this time frame, the new DSC chassis will have stale routes pointing to interfaces that are not existing any more (which were on the chassis that was reloaded). And this can lead to a large period of traffic loss. So what is the solution ? If we think through this problem, what we are asking for is the CEF / FIB to change the forwarding tables even though Routing Protocols / RIB has not asked it to do so. And this exactly fits the bill for the LFA-FRR feature. So without LFA-FRR, the convergence time during a chassis reload in an nV Edge system will be bad, LFA-FRR is a simple configuration, a basic example below. Note that LFA FRR can work with ECMP paths – one path in the ECMP list can backup the other path in the ECMP list.

router isis Cluster-L3VPN

<snip>

interface Loopback0

address-family ipv4 unicast

!

interface TenGigE0/1/0/5

address-family ipv4 unicast

fast-reroute per-link

•7.4 Multicast convergence during RACK reload or OIR

When you do rack OIR/Reload, the PIM in old standby/new active rack starts fresh (PIM is not hot standy). It triggers NSF for first 3 minutes.

By the time NSF ends, it downloads the routes to mfib and further to PD. Until this time, the A flag is not set on the rpf interface. Packets are dropped.

The difference in the case of rack OIR is, LC also goes through restart which results in topology change. However, since the new change cannot be downloaded to

PD, the update does not happen and packets are dropped. Compare this with the case of regular Switch over where only the RP node under goes a reload. In this case

Since LC remains unaffected even though mrib is under NSF window, the packets continue to be switched using old route.

To mitigate this, one needs to configure link bundles on all interfaces that have multicast flows on this, and this bundle needs to have member links in both racks, this allows a rack OIR without changing the state of the bundle interfaces.

•8 Feature Gaps

BFD Multihop is one feature that is supported on a single chassis, but not on the nV Edge system.

The nV Edge system also doesn’t support clock / syncing features like syncE.

nV Edge is only recommended with dual RSPs in each chassis due to the EOBC redundancy design. The EoBC of the ASR9001 is designed without RSP redundancy in mind, so it’s not exactly the same as chassis that support dual RSP.

•9 Convergence numbers

After configuring all the required caveats mentioned in Section 7, at the time of writing this in 4.2.3 24I early image time frame, the convergence number for an L3VPN profile with Access facing Link bundle (one member each from each chassis) and Core facing ECMP (two IGP links one from each chassis) with 3K eBGP sessions and one million routes is around 8 seconds for a Chassis Reload (any of the chassis) in the nv Edge System. The number for sure will be different for different profiles, each profile needs separate measurement and qualification / tuning. The obvious question can be that how much lower can it get ? The natural comparison that we end up doing is a comparison with an RSP failover. The factors that are (very) different between RSP failover and chassis reload are

•1. Chassis reload is a “software detected” event .. A regular RSP switchover in an ASR9K system is a “hardware detected” event because both RSPs are in the same chassis and one going down will trigger an interrupt for the other. Whereas a chassis going away is detected by loss of keep alive packets from one chassis to the other. And how fast we detect a failure is a fine balance between speed and stability. If we detect keep alive time outs too fast, the margin for errors / packet losses in the system is narrow and we might have false triggers. If we detect too slow, then the convergence suffers

•2. Chassis reload involves a heavy amount of control plane churn – line cards go away, hence interfaces go away, so the control plane protocols, control plane infrastructure (like IPC – Inter Process Communication) etc.. has to do work to update this state and make sure that it clears up data structures related to entities that went away. Imagine if the Chassis that went away had like 128K interfaces ! That will trigger quite some control plane activity

•3. Chassis reload involves updating data plane on the surviving chassis whereas RSP failover does not touch the data plane. And based on scale, this can be a time consuming activity also.

•4. Chassis reload can involve topology change and updates triggered by the neighboring boxes whereas RSP switchover is practically unknown to the peers (especially if NSR is enabled for all protocols).

Because of all these reasons, its almost impossible to achieve anything better than say 3 to 4 seconds (currently 8 seconds) for the L3VPN profile mentioned in the beginning of this section. And the delta 5 seconds might come after quite a high engineering investment towards it.

•10 “Debugging mode” CLIs – cisco support only

These clis are visible only for cisco-support users. There are many more CLIs than explained below, many of them are purely related to tuning the internal control port error-retry logic etc.. inside the driver and unlikely to be of use to anyone other than the engineers. Some of those explained below are quite “generic”, related to the UDLD protocol etc.. and hence explained below.

•1. nv edge control control-link udldpriority – this CLI sets the thread priority of the process handling the UDLD packets to higher / lower value. Maximum is 56 and minimum is 10. We sometimes try tweaking this to higher values when we find that the CPU is being loaded by some other high priority activity and hence UDLD flaps. We also tweak it sometimes to be lower in case we find that the UDLD thread itself is hogging too much CPU.

•2. nv edge control control-link udldttltomsg – this is a multiplier that affects the UDLD timeout. For some reason (say high CPU utilization or too many link errors etc..) if we want to make UDLD run slower, then this value can be set to a larger value. The UDLD timeout will be 50msecs times this multiplier

•3. nv edge control control-link allowunsupsfp – we allow only Cisco supported 1Gig SFPs in the front panel control ports, this CLI allows any SFP (that the PHY on the board supports) to be plugged in.

•4. nv edge control control-link noretry – by default if the front panel control ports have some error, a retry algorithm kicks in a backoff timer mode to bring the port up again. If we don’t want a retry, this CLI disables the retry algorithm.

•5. nv edge data allowunsup – by default only 10Gig interfaces are allowed as IRLs. If some other interface type (like 1Gig) has to be enabled as IRL for some debugging / testing, this CLI has to be configured first before the IRL config will be allowed under the unsupported interface.

•6. nv edge data stopudld – again for any debugging reasons, if the UDLD protocol has to be stopped on the IRL, this CLI can be used. Any **configured** IRL interface will be declared as available for forwarding regardless of the interface state (UP or DOWN). So be careful while using this CLI.

•7. nv edge data udldpriority - – this CLI sets the thread priority of the process handling the UDLD packets (on the line card hosting the IRL) to higher / lower value. Maximum is 56 and minimum is 10. We sometimes try tweaking this to higher values when we find that the CPU is being loaded by some other high priority activity and hence UDLD flaps. We also tweak it sometimes to be lower in case we find that the UDLD thread itself is hogging too much CPU.

•8. nv edge data udldttltomsg - this is a multiplier that affects the IRL UDLD timeout. For some reason (say high CPU utilization on the LC hosting IRL or too many link errors on IRL etc..) if we want to make UDLD run slower, then this value can be set to a larger value. The UDLD timeout will be 20msecs times this multiplier

•11 nV Edge MIBs

The SNMP agent and MIB specific configuration have no differences for the nV Edge scenario.

•11.1 Redundancy related MIBs

With upto four RSPs in an nV Edge system, and each chassis having an “Active / Standby” pair of RSPs and the nV Edge altogether having a “primary-DSC / backup-DSC” pair, there are multiple redundancy elements that come into picture. There is “node redundancy” which says in a given chassis, which node is “Active” and which node is “Standby”. There is a node-group redundancy which says in an nV Edge system, which is the “primary-DSC” and which is the “backup-DSC”. And there are “process groups” which have their own redundancy characteristics – for example protocol stacks (say ospf) have redundancy across the primary-DSC/backup-DSC pair. Whereas some other “system” software elements will have redundancy across the “Active / Standby” RSPs in each chassis. This relationship is called “process groups” and their redundancy. The table below summarises the mibs.

MIB	Node Redundancy	Process Redundancy	Description
CISCO-RF-MIB			Currently provides DSC chassis active/standby node pair info. In nV Edge scenario should provide DSC primary/backup RP info. Provides switchover notification.
ENTITY-STATE-MIB	Status only; no relationships		Provides redundancy state info for each node. No relationships indicated.
CISCO-ENTITY-STATE-EXT-MIB			Extension to ENTITY-STATE-MIB which defines notifications (traps) on redundancy status changes.
CISCO-ENTITY-REDUNDANCY-MIB	Both status and relationships	Process group redundancy relationships & node status	Define redundancy group types: •1) Node redundancy group type •2) Process group redundancy type Node redundancy pairs would be shown in groups with the node redundancy group type. Primary/backup nodes for each process group placed on them.

•11.1.1 Node Redundancy MIBs

CISCO-RF-MIB is currently used to monitor the node redundancy of the DSC chassis’ active/standby RPs. The MIB definition is limited to representing redundancy relationships, status, and other info of only 2 nodes

CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes. The redundant node pairs are defined as redundancy groups with a group type indicating the group is a redundant node pair. The members of the group would be the nodes within the node-redundant pair.

•11.1.2 Process Redundancy MIBs

Support for the CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes pertaining to the specific process groups. The redundant process groups are defined as redundancy groups with a group type indicating the group is a redundant process group. The members of the group would be the nodes where the primary and backup processes are placed for that process group.

•11.1.3 Inventory Management

The inventory information for each chassis and the respective physical entities will be available just as in the single chassis. The difference for ASR9K nV Edge (as in CRS multi-chassis) is the presence of a top-level entity in the hierarchy which acts as a container of the chassis entities. This entity will have entPhysicalClass value of ‘stack’.

Rack 0 -- index 24555730

entPhysicalClass = ‘chassis’

entPhysicalContainedIn = 1

entPhysicalParentRelPos = 0

Slot 0/0 -- index 28091685

entPhysicalClass = ‘container’

entPhysicalContainedIn = 24555730

entPhysicalParentRelPos = 0

...

Rack 1 -- index 141995845

entPhysicalClass = ‘chassis’

entPhysicalContainedIn = 1

entPhysicalParentRelPos = 1

Slot 0/0 -- index 139707424

entPhysicalClass = ‘container’

entPhysicalContainedIn = 141995845

entPhysicalParentRelPos = 0

...

Rack N -- index 1481742692

entPhysicalClass = ‘chassis’

entPhysicalContainedIn = 1

entPhysicalParentRelPos = N

Slot 0/0 -- index 1523535239

entPhysicalClass = ‘container’

entPhysicalContainedIn = 1481742692

entPhysicalParentRelPos = 0

...

Stack -- index 1

entPhysicalClass = ‘stack’

entPhysicalContainedIn = 0

entPhysicalParentRelPos = -1

•11.2 IRL monitoring MIBs

IRL interface are in ALL respects just a regular IOS-XR interface. All the standard interface mibs for reporting errors / alarms / faults on the link will apply to the IRL links. Also all the standard mibs for the interface statistics will also apply to these links.

One missing MIB is for the “uni-directional” forwarding state of the IRL. For example if there is excessive packet loss on IRL which makes it go into a UDLD state of “uni-directional”, that is a fault scenario and that IRL link is removed from all forwarding tables, even though the physical state of the interface remains UP. This will be an enhancement required to get this event reported to MIB. One approach would be to just shut the link down on uni-directional fault so that the standard ifmib can trap this event.

•11.3 Control Ethernet monitoring MIBs

The CRS Multi chassis system has implemented some MIBs for the Control Ethernet aspects of the system :- they are currently not implanted for the nv Edge system. But since the nV Edge system control Ethernet is very similar to CRS Multi Chassis Control Ethernet, we can implement those exacts MIBs for the nV Edge system also. That would be an enhancement work item.

The Control Ethernet MIB frontend is a collection of MIBs as below.

•1. IF-MIB implementation upgraded to support Control Ethernet interfaces
•2. CISCO-CONTEXT-MAPPING MIB implementation.
•3. Context aware implementation of BRIDGE-MIB
•4. Implementation of MAU-MIB
•5. Implementation of CISCO-MAU-EXT-MIB, which will distinguish the MAUs associated with Control Ethernet interfaces from those associated with other data-plane interfaces
•6. ENTITY-MIB upgraded to support Control Ethernet related entities like Control Ethernet Bridges and associated bridge-ports and all Control Ethernet interfaces

.

•11.4 Control Ethernet Syslog / error messages

Below we down the most important syslog error messages that indicates some fault with the control Ethernet module or links.

•1. Front panel nV Edge Control Port <port> has unsupported SFP plugged in. Port is disabled, please plug in Cisco support 1Gig SFP for port to be enabled

LOG_INFO message: This message pops up if user inserts a Cisco unsupported SFP in the front panel SFP+ port. User has to replace the SFP with a Cisco supported one and the port will automatically get detected / used again.

•2. Front Panel port <port> error disabled because of UDLD uni directional forwarding. There will be automatic retries to try and bring up the port periodically

LOG_CRIT message: This message pops up if a particular control Ethernet links has a fault and keeps “flapping” too frequently. If that happens then this port is disabled and will not be used for control link packet forwarding.

•3. ce_switch_srv[53]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 12 (SFP+_00_10GE) is up

ce_switch_srv[53]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 12 (SFP+_00_10GE) is down

These messages pops up whenever the Control Plane link (the front panel links) physical state changes up up/down – more like a regular interface up/down event notification. The “Interface 12” and “Interface 13” (the 12 and 13) are just internal numbers for the two front panel ports. These messages will pop up anytime a remote RSP goes down or boots up because at those instances the remote end laser goes down/up. But during normal operation of the nV Edge system when there are no RSP reboots etc.., these messages are not expected and indicates a problem with the link / sfp etc..

•11.5 Data Link Syslog / error messages

Here we describe the syslog / error messages related to the IRL links that can appear in the logs and describe them so that user is aware of what those messages mean.

•1. Interface <interface handle> has been uni directional for 10 seconds, this might be a transient condition if a card bootup / oir etc.. is happening and will get corrected automatically without any action. If it’s a real error, then the IRL will not be available for forwarding inter-rack data and will be missing in the output of show nv edge data forwarding CLI.

Here the interface name being referred to can be found by saying “show im database ifhandle <interface handle>” – that particular interface has encountered a uni directional forwarding scenario and will be removed from the forwarding tables – no more data will be forwarded across those IRLs. We will try re-starting UDLD on that link again after 10 seconds to see if the UDLD can become bi-directional again, so this retry will keep happening every 10 seconds until the link goes bi-directional or the user decides to unconfigure “nv edge interface” on that link forever.

•2. <count> Inter Rack Links configured all on one slot. Recommended to spread across at least two slots for better resiliency.

All the IRL links are present on the same line card (slot). This is not good for resiliency reasons. If that line card goes down, all the IRL links also go down. So the message periodically pops up asking the user to configure the IRLs to be spread across at least two slots.

•3. Inter Rack Links configured on <count> slots.Recommended to spread across maximum 5 slots for better manageability and troubleshooting.

The total number of IRLs in the system (maximum 16) is recommended to be spread across NO MORE than 5 line cards (slots). This is purely for debuggability reasons, debugging problems across more than 5 IRL LCs becomes a complex affair and hence a recommendation is to limit the spread to maxium 5 slots.

•4. Only one Inter Rack Link is configured. For Inter Rack Link resiliency, recommendation is to have at least two links spread across at least two slots.

We recommend having at least two IRL links for resiliency reasons.

•12 Debugs and Traces

The output of show tech mentioned below can be redirected to a file / tftp server etc.. Use when in doubt as to which module traces to collect.

•1. show tech nv edge

•13 Cluster Rack-By-Rack Upgrade

•13.1 Overview

ISSU is not supported on cluster, let that be very clear. Though in the event of a software upgrade from any to any release, or during a SMU installation, it’s highly recommended that the following steps are following to avoid the standard 10 minutes or so of reload time after an upgrade. The method used here upgrades each system separately. The assumption here is the network is fully redundant and all links are dual homed to each of the chassis in the cluster, which translates to continuous connectivity while any one of the chassis in the cluster is down. The method here is scripted and an off system server/pc must be used to execute the script.

Rack-By-Rack reload is a method of upgrading, or installing disruptive software (ie reload SMUs) on the Cluster one rack at a time, in order to reduce the amount of traffic downtime compared to a full system reloads...

At a high level, the upgrade steps are as follows:

Rack 1 Shutdown Phase - Rack 1 is isolated from the Cluster and the external network, and made into a standalone node.
- IRL links are disabled
- External LC interfaces are disabled
- Control Link interfaces are disabled
Rack 1 Activate Phase - The target software is activated on Rack 1
- Install Activate occurs on Rack 1 using the parallel reload method.
Critical Failover Phase - Traffic is migrated to Rack 1
- All interfaces on Rack 0 are shut down.
- All interfaces on Rack 1 are brought into service.
- Protocols relearn routes from neighboring routers and convergence begins.
Rack 0 Activate Phase - The target software is activated on Rack 0
- Install Activate occurs on Rack 0 using the parallel reload method
Cleanup Phase
- Control links are reactivated
- IRL Links are reactivated
- Rack 0 rejoins the cluster as Backup
- Any external links disabled as part of the upgrade are brought back into service

Due to the complexity of the CLI steps used, it is recommended to use the scripted method below.

•13.2 Prerequisites

Rack By Rack Upgrade is not compatible with the Management LAN Split Brain detection feature. This feature should be disabled prior to upgrade.
Any Install operations in progress need to complete (install commit) prior to this upgrade.
All Active install packages must be committed prior to this upgrade procedure.
Support for this method is added in 4.3.1. For 4.2.3 it’s part of NV edge SMU 1 CSCue14377
The script does only minimalist checking for any errors that occur. It is recommended to use "install activate test" on the router prior to script execution to validate the set of images.
It is highly recommended to backup your router config prior to upgrade.

•13.3 Upgrade Instructions (Scripted Method)

•13.3.1 Script Setup

The upgrade script may be obtained by copying it from the router to a tftphost via the "copy" command. The file is located on the router at: "#run /pkg/bin/nv_edge_upgrade.exp".

This script must be customized to your particular install. This is done by modification of the variables at the top of the script. The required changes are:

The management telnet access for Rack 0 (rack0_addr, rack0_port, rack0_stby_addr, rack0_stby_port)
The management telnet addresses for Rack 1 (rack1_addr, rack1_port, rack1_stby_addr, rack1_stby_port)
The login credentials for the router (router_username, router_password)
The set of images to activate (image_list), space delimited.
The set of IRL ports configured (irl_list), TCL list format.

An example of the script configuration variables is below:

set rack0_addr "172.27.152.19"

set rack0_port "2002"

set rack0_stby_addr "172.27.152.19"

set rack0_stby_port "2004"

set rack1_addr "172.27.152.19"

set rack1_port "2005"

set rack1_stby_addr "172.27.152.19"

set rack1_stby_port "2007"

set router_username "root"

set router_password "root"

set image_list "disk0:asr9k-mini-px-4.2.3 \

disk0:asr9k-services-p-px-4.2.3 \

disk0:asr9k-px-4.2.3.CSCuc40191-0.0.2.i"

set irl_list {{Teng 0/1/1/2} {Teng 1/1/0/2}}

In this example, the console ports of all four RSP's of the cluster are connected to 172.27.152.19, and the ports are specified. The router login is root/root, three software packages are intended to be activated, and the script expects only one IRL link as specified.

•13.3.2 Script execution

To begin the install activation via the script exit all consoles completely (exit to login prompt), and disconnect all serial and telnet connections to the management console of the router. Execute the script from an external linux workstation as below:

sjc-lds-904:> <strong>nv_edge_upgrade.exp </strong>

########################

This CLI Script performs a software upgrade on

an ASR9k Nv Edge system, using a rack-by-rack

parallel reload method. This script will modify

the configuration of the router, and will incur

traffic loss.

Do you wish to continue [y/n] <strong>y</strong>

spawn telnet 172.27.152.19 2002

Trying 172.27.152.19...

Connected to 172.27.152.19.

Escape character is '^]'.

RP/0/RSP0/CPU0:ios#

In the example here, the script is executed by typing "nv_edge_upgrade.exp". Please ensure that the script is given execution file privileges. When prompted if you wish to continue the software activation, enter "y" to continue.

At various points during the upgrade process the script will enter into a waiting period and display a message as below:

--- WAITING FOR INSTALL ACTIVATE RACK 0 60 SECONDS (~~ to abort / + to add time) ---

CLI commands may be entered at this time to check the router status during the upgrade process. This is intended to allow sufficient time for the various steps of the upgrade to complete, and for the router to achieve a stable state before continuing. It is important that no configuration changes are made while the prompt is available.

The script will run to completion in approximately 45 minutes.

•13.3.3 Verification

Once the script runs to completion, please connect to the router, verify that the platform is in working order, and that routing and traffic have resumed. Loss of topology and some loss of traffic is expected during the upgrade process. Expected traffic loss is between 30 seconds and 4 minutes on "normal scale" systems, and can be as long as 10 minutes in high scale scenarios.

Install commit is included in the script execution. To revert to the prior release after script completion, a separate install operation is needed. Reload of the system will not cause an install revert.

•13.4 Upgrade Instructions (Manual Method)

The upgrade process can be executed entering the CLI commands directly onto the console instead of using the provided script. This is not recommended, as the upgrade process is sensitive to the ordering and timing of various steps of the upgrade. If a CLI command is omitted, or the commands entered in the incorrect order it may have catastrophic effect.

Within the script a variable is defined "debug_mode". Set this to "1", and then execute the script from the linux prompt. This will cause the script to output the CLI commands to the terminal window, and can be used as a basis for the manual upgrade.

•13.5 Install Abort Procedure

Abort of the software installation is allowed at or anytime prior to the following output message:

--- WAITING FOR INSTALL COMMIT 10 SECONDS (~~ to abort / + to add time) ---

The Abort procedure is as follows:

Use "ctrl-c" to terminate the script operation. You may be required to enter "~~" to terminate a wait period, and then "ctrl-c" to terminate, depending on the state of the script.
Log into the router console connection on rack 1.
- Enter "admin reload rack 1", and confirm.
- Halt the RSP bootup for rack 1 (both active and standby).
- unset the variables "CLUSTER_0_DISABLE" and "CLUSTER_1_DISABLE" from both RSP cards.
Log into the router console connection on rack 0.
- Configure the nv-edge control links to be enabled.
- Configure the IRL links to be no-shut.
- Remove any "nv edge data minimum" link configuration.
Boot-up Rack 1.

Rack 1 will automatically sync to the prior software load running on Rack 0.

•13.6 Converting a nV Edge Cluster to single chassis system

It’s possible to change an nV Edge system back to two separate single chassis systems. The steps to do this are fairly simple, though console access is required to all RSPs.

•1) Set the config-register to 0x0 on all racks. An alternative is to break into ROMMON on all RSPs during bootup from the console, though this has to be done in quickly on all RSPs before they start booting XR.

RP/0/RSP0/CPU0:A9K-PE1(admin)#config-register 0x0

Sat Mar 23 09:21:38.700 UTC

Successfully set config-register to 0x0 on node 0/RSP0/CPU0

Successfully set config-register to 0x0 on node 0/RSP1/CPU0

Successfully set config-register to 0x0 on node 1/RSP0/CPU0

Successfully set config-register to 0x0 on node 1/RSP1/CPU0

RP/0/RSP0/CPU0:A9K-PE1(admin)#

•2) Remove the Rack/Serial Number mapping from nV edge configuration from the admin plane on the DSC chassis
•3) Break all the IRL, EOBC links in the following order:
•a. Shut down the IRL interfaces
•b. Followed by "nv edge control control-link disable" for all EOBC links
•4) Reload location all from admin mode on both systems, and from the console you should see all the RSPs are now in ROMMON
•5) From the ROMMON prompt clear the following variable on all RSPs.

ROMMON> unset CLUSTER_RACK_ID

ROMMON> sync

•6) Reset the config register to 0x102 if it was changed, or just perform “reset” and the RSPs should boot up as separate single chassis

At this point both chassis are separated, care needs to be taken, since the chassis will have the same config, hence the same router id, that could lead to protocol instability and duplicate system IDs.

NV Edge Deployment Guide For ASR9K

•1 Glossary

•2 Converting Single chassis ASR9K to nV Edge

•2.1 Supported hardware and caveats

•2.2 Booting with different images on each chassis

•2.3 Configuring the Management Ethernet network for nV edge

•3 nV Edge Control Plane

•3.1 High Redundancy wiring (Recommended)

•3.2 RSP in each chassis (NOT recommended)

•3.3 Control Plane UDLD

•3.4 Control Link status CLI

•3.5 Control Link shut/no shut CLIs

•3.6 Miscellaneous control link CLIs

•4 nV Inter Rack Link (IRL) connections

•4.1 UDLD on IRL links

•4.2 What are the IRL links used for?

•4.3 nV IRL “threshold monitor”

•4.3.1 Backup-rack-interfaces config

•4.3.2 Specific-rack-interfaces config

•4.3.3 selected-interfaces config

•4.3.4 What is the default config

•4.4 Default QoS on IRL links

•4.5 Configurable QoS on IRL interfaces

•4.6 IRL packet encapsulation and overhead

•4.7 IRL load balancing

•4.7.1 Ingress IP packet

•4.7.2 Ingress MPLS packet

•4.7.3 L2 Unicast

•4.7.4 L2 Flood

•4.7.4.1 Flood optimized mode

•4.7.4.2 Convergence / Resiliency mode

•4.7.5 L3 Multicast

•5 nV Edge Redundancy model

•5.1 Redundancy switchover: Control Ethernet readiness

•5.2 RSP/Chassis failure Detection in ASR9k nV Edge

•6 Split Node

•6.1 All IRL links go away

•6.2 All Control links go away

•6.3 All Control AND IRL links go away

•7 Feature configuration caveats

•7.1 Virtual Interfaces (Link bundle / BVI) mac-address

•7.2 Link Bundle “switchover suppress-flap” : Rack / Chassis reload

•7.3 IGP protocols and LFA-FRR

•7.4 Multicast convergence during RACK reload or OIR

•8 Feature Gaps

•9 Convergence numbers

•10 “Debugging mode” CLIs – cisco support only

•11 nV Edge MIBs

•11.1 Redundancy related MIBs

•11.1.1 Node Redundancy MIBs

•11.1.2 Process Redundancy MIBs

•11.1.3 Inventory Management

•11.2 IRL monitoring MIBs

•11.3 Control Ethernet monitoring MIBs

•11.4 Control Ethernet Syslog / error messages

•11.5 Data Link Syslog / error messages

•12 Debugs and Traces

•13 Cluster Rack-By-Rack Upgrade

•13.1 Overview

•13.2 Prerequisites

•13.3 Upgrade Instructions (Scripted Method)

•13.3.1 Script Setup

•13.3.2 Script execution

•13.3.3 Verification

•13.4 Upgrade Instructions (Manual Method)

•13.5 Install Abort Procedure

•13.6 Converting a nV Edge Cluster to single chassis system