In this document we'll show you how to configure 2 ASR9000 (of the same kind) into a cluster setup.
Cluster provides a significant advantage over 2 separate single physical chassis by simplifying management (the 2 nodes will act as a single entity) while maintaining state of the art redundancy.
In cluster, a device can dual home into each of the nodes (known as "racks") with for instance a bundle ethernet or ether channel, and since the 2 racks are a single physical entity, there is only one routing peering, so no need for ECMP. Also there is no need for MC-LAG or other complexities for L2 environments.
nV – Network Virtualization
nV Edge – Network Virtualization on Edge routers
IRL – Inter Rack Links (for data forwarding)
Control Plane – the hardware and software infrastructure that deals with messaging / message passing across processes on the same or different nodes (RSPs or LCs).
Data Plane – the hardware and software infrastructure that deals with forwarding, generating and terminating data packets.
DSC – Designated Shelf Controller (the Primary RSP for the nV edge system)
Backup-DSC – Backup Designated Shelf Controller
UDLD – Uni Directional Link Detection protocol. An industry standard protocol used in Ethernet networks for monitoring link forwarding health.
FPD – Field Programmable Device (fpgas etc.. which can be upgraded).
This section assumes that the single chassis boxes are running 4.2 or earlier images. If they are already running 4.2.1 or later, we might be able to avoid the first two steps. Take note of the general release recommendation which at the time of writing is XR 4.2.3
(admin)#show inventory chassis
NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"
PID: ASR-9006-AC, VID: V01, SN: FOX1435GV1C
NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"
PID: ASR-9006-AC, VID: V01, SN: FOX1429GJSV
Alternately, from rommon, the command “bpcookiebpcookie” can be used to get the serial number, look for the “Chassis Serial Number” description in the output of the command.
The above configuration is just building a “data base” on Rack0 for all the chassis serial numbers and what rack numbers are assigned to those serial numbers. One purpose for this is to figure out whether a chassis that tries to become part of this nV Edge system is really “allowed” to be part of this nV edge or not.
NOTE: The Control Ethernet cabling should be done only after all the previous steps have been executed and both the chassis are ready to “join” an nV Edge system. If Control Ethernet cables are connected between two functional independent single-chassis ASR9K nodes, that will wreak havoc in the system because the independent chassis’ control planes will get “mixed up” when they are not yet ready to “join” an nV Edge system.
NOTE: ALL the interfaces on the chassis having the backup-DSC RSP will be in SHUTDOWN state till at least one Inter-Rack Data Link is in forwarding state. Discussed later in this write up in more details.
At any time in the nV Edge system, one of the RSPs in the nV edge system (in either Rack0 or Rack1) will be the “master” for the entire nV edge system. Another RSP in the system (again either in Rack0 or Rack1) will be the “backup” for the entire nV edge system. The “master” is called a primary-DSC using CRS Multi chassis terminology. The “backup” is called a backup-DSC. The primary-DSC will run all the primary protocol stacks (OSPF, BGP etc..) and the backup-DSC will run all the backup protocol stacks.
At any time, to find out which RSP is primary-DSC and which is backup-DSC, use the below command in admin exec mode.
Node ( Seq#) Role Serial# State
0/RSP0/CPU0 ( 0) ACTIVE FOX1432GU2Z BACKUP-DSC
0/RSP1/CPU0 ( 1223769) STANDBY FOX1432GU2Z NON-DSC
1/RSP0/CPU0 ( 1279475) ACTIVE FOX1441GPND PRIMARY-DSC
1/RSP1/CPU0 ( 1279584) STANDBY FOX1441GPND NON-DSC
As can be seen above, the Rack1 RSP0 (1/RSP0/CPU0) is the primary-DSC and Rack0 RSP0 (0/RSP0/CPU0) is the backup-DSC. The Primary and Backup DSCs do not have any “affinity” towards any one chassis or any one RSP. Whichever chassis in the nV edge system boots up first will likely select one of its RSPs as the primary-DSC.
Also another matter to note is that the “Active” / “Standby” states of the RSPs which we are familiar with in the single chassis mode of operation are superceded by the primary-DSC backup-DSC functionality in an nV Edge system. For example, in a single chassis system, protocol stacks used to run on the Active and Standby RSPs in a single chassis as primary/backup protocol stacks. But as we figured out in the preceeding paragraph, that is no more the case in an nV Edge system – in nV edge, the primary-DSC and backup-DSC are what runs the primary/backup of protocol stacks.
In an nV Edge system, for whatever reason if both the chassis end up having dis-similar images installed, then the chassis that boots up later will tell the already booted chassis about its version details – the already booted chassis will “reject” that version and tell that chassis to go down to rommon and send boot request to the already booted chassis to download the image that is present on the already booted chassis.
The nV Edge control plane provides software and hardware extensions to create a “unified” control plane for all the RSPs and line cards on both the nv Edge chassis. The control plane packets are forwarded from chassis to chassis all “in hardware” as you will see in sections below. Control plane multicast etc.. is done in hardware for both the chassis – so there is no control plane performance impact because there are two chassis instead of one.
The nV Edge control plane links HAVE to be direct L1 connections, there is no network or intermediate routing / switching devices allowed in between. Some details of the control plane connections are provided below to provide a better understanding of what exactly is the reasoning behind our recommendations. The control Ethernet links (front panel SFP+ ports) are configured in 1Gig mode of operation. The links numbered 1, 2, 3, 4 (red in colour) are the links that we are referring to that needs the wiring, the other links are there just for further illustration purpose as can be seen below.
As seen in the diagram above, each RSP in each chassis has an Ethernet switch to which all the CPUs in the system (Line Card CPUs, RSP CPUs, any other CPUs in the system) connect to. So each CPU connects to two switches – one on each RSP. At any point in time, only one of the switches will be “active” and switching the control plane packets, the other will be “inactive” (regardless of whether system is nV edge or single chassis). And the “active” switch can be on either of the RSPs in the chassis, whichever switch can ensure the best connectivity across all the CPUs in the system.
The two SFP+ front panel ports on RSP3 are just direct ports plugging into the switch on the RSP. So as shown in the diagram, for an nV Edge system, the simple goal is to connect each RSP (switch inside the RSP) to each switch on the remote chassis. So in the above case if any of the links go down, there are three possible backup links. Also at any point in time, only one of the links will be used for forwarding control plane data, all the other three links will be in “standby” state.
The control Ethernet is the heart of the system – anything wrong with it can badly degrade the nV edge system. So it is HIGHLY recommended to use all four control Ethernet links.
The above mode of operation is possible in a “steady state”. Even if one link fault (link 1 or 2), there is one more link that can take over. So assume link 1 is faulty and we have only link 2 left now. And in this scenario, say the RSP 1/rsp1 ended up reloading because of some software fault. Then we are left with a case of no control Ethernet links at all between the chassis and in that mode, the chassis hosting the backup-DSC RSP will take down itself and go to rommon, the chassis hosting the DSC RSP will continue functioning, thus avoiding a Split Node.
In the case of a single RSP-per-chassis nV Edge topology, the below will be the wiring model. But again, this is not recommended because of resiliency reasons. If the only RSP in a chassis goes down, the entire chassis and all the line cards in the chassis also go down !
We run UDLD on the control plane links to ensure bi-directional forwarding health of the links. The UDLD is run at 200 msecs interval x 5 - ie, an expiry interval of 1 second. Which means that if a control link is uni-directional for 1 second, then the RSPs will take action to switch the control plane link to one of the three standby links.
Note that the one second detection is only for unidirectional failures – for a physical link fault (like fiber cut), there will be interrupts triggered with the fault and the link switchover to the standby links will happen much faster.
The front panel SFP+ ports are referred to as ports “0” and “1” in the show command below. So each RSP has two of these ports, and the command below shows which port on which RSP is connected to which other port on which other RSP.
In the example below:
Port “0” on 0/RSP0 is connected to port “0” on 1/RSP0.
Port “1” on 0/RSP0 is connected to port “1” on 1/RSP1
Port “0” on 0/RSP1 is connected to port “0” on 1/RSP1
Port “1” on 0/RSP1 is connected to port “1” on 1/RSP0
Also, the “port pair” that is “active” and used for forwarding control Ethernet data is the link between port “12” on 0/RSP0 and port “12” on 1/RSP0 as shown in the state Forwarding below. All other links are just backup links.
The “CLM table version” is also a useful number to note. This number if it changes means that the control link UDLD is flapping. So in a good “stable” condition, that number should not change.
RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/rSP0/CPU0
Priority lPort Remote_lPort UDLD STP
======== ===== ============ ==== ========
0 0/RSP0/CPU0/0 1/RSP0/CPU0/0 UP Forwarding
1 0/RSP0/CPU0/1 1/RSP1/CPU0/1 UP Blocking
2 0/RSP1/CPU0/0 1/RSP1/CPU0/0 UP On Partner RSP
3 0/RSP1/CPU0/1 1/RSP0/CPU0/1 UP On Partner RSP
Active Priority is 0
Active switch is RSP0
CLM Table version is 2
Each RSP has two front panel Control link ports which we number as 0 and 1. The CLI to shut the links is as below
RP/1/RSP0/CPU0:A9K-Cluster-IPE(admin-config)#nv edge control control-link disable <0 or 1 > location <the rsp where we want the port to be shut>
The “no nv edge control control-link disable ..” will unshut this link.
On shutting a control port, the CLI will also set a rommon variable on that RSP like “CLUSTER_0_DISABLE = 1” if port 0 is disabled and “CLUSTER_1_DISABLE = 1” if port 1 is disabled. As long as this rommon variable is set, neither rommon nor IOS-XR will ever enable that port.
The behavior when ALL the control links is shut obviously that both chassis becomes DSC .. But if the IRL links are active, then one of the chassis will reload, reboot and again once the IRL link comes back up it will again reboot.
So if someone configured ALL control links to be shut, how do we recover from that ? Currently this is the recommended procedure at the time of writing this in 4.2.3 24I early image.
1. Shut the IRL links from one of the chassis (whichever chassis doesn’t reboot, remember one chassis comes up and reboots). This will get both chassis to stay UP.
2. Reload one chassis and keep BOTH the RSPs in rommon and unconfigure a rommon variable as below, do this on BOTH the RSPs
3. On the other chassis which is still in XR, go to admin config and say “no nv edge control control-link disable <port> <location>” for each port and location where the port was shutdown.
4. On the RSPs in rommon, say the below
NOTE: The above is indeed a cumbersome and lengthy procedure (but only if we shut all control links), we have an enhancement which will be committed in 4.2.3 where the procedure to unshut would be very simple – on whichever chassis that doesn’t reboot, go to admin config mode and just say “no nv edge control control-link disable <port> <location>” and that will automatically take care of syncing it with the other chassis also.
SFP Plugged in : 0x00000001 (1)
SFP Rx LOS : 0x00000000 (0)
SFP Tx Fault : 0x00000000 (0)
SFP Tx Enabled : 0x00000001 (1)
The “SFP Plugged in” should be value 1 if theres an SFP present. The “SFP Rx LOS” should be 0 or else there is Rx Loss of Signal (an error !). The “SFP Tx Fault” should be 0 or else theres is an SFP Fault (an error !). The “SFP Tx Enabled” should be 1 or else the SFP is not enabled from the control Ethernet driver (also an error !).
Admin UP : 0x00000001 (1)
SFP supported cached : 0x00000001 (1)
PHY status register : 0x00000070 (112)
The “Admin UP” 0 would mean that customer has configured “nv edge control control-link-disable <port> <location>” CLI. Without that config, it should be value 1 which is the default. The “SFP supported cached” indicates whether user plugged in a Cisco supported SFP – value 1 means the SFP is supported, 0 means SFP is not supported. If the control link has an SFP plugged in and has a cable connected to a remote end and the remote end is also up and laser is good, link is good etc.., then the “PHY status register” should have a value of 0x70, it is an internal PHY register which says that the link is all good. If there is no cable or no SFP or bad cable or bad link etc.., it will not be value 0x70, this can be sometimes useful during debugging.
The IRL connections are required for forwarded traffic going from one chassis out of interface on the other chassis part of the nV edge system. The requirement for the IRL link is that it has to be a 10 Gig link and that they have to be direct L1 connections – no sort of routed/switched devices are allowed in between. There can be a maximum of 16 such links between the chassis. Also recommended is a minimum of 2 links obviously for better resiliency, and also that the two links be on two separate line cards, again for better resiliency in case one line card goes down due to any fault.
The configuration of an interface as IRL is simple, its as below
interface tenGigE 0/1/1/1
Add this config on the IRL interfaces on both chassis of course ! We run UDLD over these links to monitory bi-directional forwarding health of these links. Only when UDLD reports that the echo and echo response are all fine (standard UDLD state machine), then we place the interface into “Fowarding” state, till then the interface is in “Configured” state. So the IRL interface might be “Configured” but not “Forwarding”, once its both, then it will be used for forwarding the data across chassis.
RP/0/RSP0/CPU0:ios#show nv edge data forwarding location 0/rSP0/CPU0
nV Edge Data interfaces in forwarding state: 1
tenGigE 0_1_1_1 <--> tenGigE 1_1_0_1
nV Edge Data interfaces in configured state: 2
The above CLI says that there are two IRLs in “Configured” state (marked blue) – of course one on each Rack. The CLI also says that there is one “pair” of IRLs in “Forwarding” state (marked green). The “pair” is one from each rack. So the UDLD protocol automatically detects which interface is connected to which other and forms a “pair”.
So if you have configured IRLs, but you don’t see the line “nV Edge Data interfaces in forwarding state:” in your CLI output, then that means that something is wrong. We would recommend going through the standard interface checklist
-> Are the cables and SFPs all good ?
-> Are the interfaces unshut and Up/Up ?
-> Are there interface drops or errors ?
-> If you are conversant with the packet path, are there any other packet path drops ?
The UDLD timers on the IRL links are set to 40 milliseconds times 5 hellos, ie around 200 msecs as the expiry timeout. That means that any uni-directional problem with the IRL links will be detected & corrected in around 250 msecs (200 msecs + delta for processing overheads).
If you want to see the UDLD state machine on the line card hosting these links, then the below CLI can be used. The Interface [number in red] is what we call the “ifhandle”. The interface name corresponding to that can be displayed using the CLI “show im database ifhandle <number in red> location <line card>”.
In the example below, the UDLD state is Bidirectional, which is the desired correct state when things are working fine.
RP/0/RSP0/CPU0:ios#show nv edge data protocol all location 0/1/cPU0
Port enable administrative configuration setting: Enabled
Port enable operational state: Enabled
Current bidirectional state: Bidirectional
Current operational state: Advertisement - Single neighbor detected
Message interval: 20 msec
Time out interval: 10000 msec
Expiration time: 140 msec
Device ID: 1
Current neighbor state: Bidirectional
Device name: CLUSTER_RACK_01
Port ID: [0x46000100]
Neighbor echo 1 device: CLUSTER_RACK_00
Neighbor echo 1 port: [0x60002c0]
Message interval: 20 msec
Time out interval: 100 msec
CDP Device name: ASR9K CPU
The IRL links are used for forwarding packets whose ingress and egress interfaces are on separate racks. They are also used for all protocol Punt packets and protocol Inject packets. As explained in Section 1, the protocol stack “Primary” runs on the primary-DSC RSP in one of the chassis. So if a protocol punt packet comes in on an interface in another chassis, it has to be punted to the primary-DSC RSP in the remote chassis. This punt is done via the IRL. Similarly if the protocol stack on the primary-DSC wants to send a packet out of an interface on another chassis, that is also done via the IRL interfaces.
If the number of IRL links available for forwarding goes below a certain threshold, that might mean that the remaining IRLs will get congested and more and more inter-rack traffic will get dropped. So the IRL-monitor gives a way of shutting down other ports on the chassis if the number of IRL links go below a threshold. The commands available are below
RP/0/RSP0/CPU0:ios(admin-config)#nv edge data minimum <minimum threshold> ?
backup-rack-interfaces Disable ALL interfaces on backup-DSC rack
selected-interfaces Disable only interfaces with nv edge min-disable config
specific-rack-interfaces Disable ALL interfaces on a specific rack
There are three modes of configuration possible.
With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on whichever chassis is hosting the backup-DSC RSP will be shut down. Again note that the backup-DSC RSP can be on either of the chassis.
With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on the specified rack (0 or 1) will be shut down.
With this configuration, if the number of IRLs go below the <minimum threshold> configured, the interfaces on any of the racks that are explicitly configured to be brought down will be shut down. How do we “explicitly” configure an interface (on any rack) to respond to IRL threshold events ?
RP/0/RSP0/CPU0:ios(config)#interface gigabitEthernet 0/1/1/0
RP/0/RSP0/CPU0:ios(config-if)#nv edge min-disable
So in the above example, if the number of IRLs go below the configured minimum threshold, interface Gig0/1/1/0 will be shut down.
The default config (if customer does not configure any of the above explicitly) is the equivalent of having configured “nv edge data minimum 1 backup-rack-interfaces”. Which means that if the number of IRLs in forwarding state goes below 1 (at least 1 forwarding IRL), then ALL the interfaces on whichever rack that has the backup-DSC, will get shut down.
This might make some customers happy, some unhappy. This behaviour can be turned off by just saying “nv edge data minimum 0 backup-rack-interfaces” – basically this says that if the number of IRLs in forwarding state goes below 0 (which will never happen), only then we should bother shutting any interface on any rack.
When an interface is configured as an IRL link, we install 5 absolute priority queues on the port in both the ingress and egress directions. The priorities are below
The IRL links do not allow “user configurable” MQC policies on the IRL interface. The classification of “punt / inject” and “multicast” are done “internally” in microcode – that is, other than being a punt/inject or multicast packet, there is no way by which we can “influence/force” a packet to go to the first two queues.
What packet gets into the last three queues can be influenced – just by having QoS ingress policies that mark packets appropriately to be acos 0, 1 or 2. There is no other way by which we can influence what gets into these queues. The queue id selected on the ingress chassis’s IRL links is carried across in the Vlan COS bits, the egress chassis’s IRL that gets this packet will use this queue id encoded in the Vlan COS to select the queues it uses on Ingress (when it receives the packets from the remote chassis).
The CLI to display the nV edge qos queues is as below for example using an IRL interface with configs below. The subslot number 0 in the example is the “subslot” in which the MPA (the pluggable adaptor) is on the MOD-80/160 line card in Viking. If the line card is not of a type that supports pluggable adaptors, just use 0 for subslot. The port number 1 used in the example is simply the last number in the 1/1/0/1 notation.
The drops (if any) in these queues are aggregated and reflected in the “show interface” drops also. The standard interface MIBs can be used for monitoring these drops. Note that the individual queue drops are not exported to MIBs, only the aggregate drops are exported as the interface drops. Also the IRL links are just regular interfaces, so the regular interface MIBs will all work on IRLs also.
RP/0/RSP0/CPU0:ios#sh running-config interface gigabitEthernet 1/1/0/1
RP/0/RSP0/CPU0:ios#show qoshal cluster subslot 0 port 1 location 1/1/cPU0
Cluster Interface Queues : Subslot 0, Port 1
Port 1 NP 0 TM Port 17
Ingress: QID 0xa8 Entity: 0/0/0/4/21/0 Priority: Priority 1 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f0348/0x0/0x5f0349
Total Xmt 681762/140538069 Dropped 0/0
Ingress: QID 0xa9 Entity: 0/0/0/4/21/1 Priority: Priority 2 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f034d/0x0/0x5f034e
Total Xmt 0/0 Dropped 0/0
Ingress: QID 0xab Entity: 0/0/0/4/21/3 Priority: Priority 3 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f0357/0x0/0x5f0358
Total Xmt 0/0 Dropped 0/0
Ingress: QID 0xaa Entity: 0/0/0/4/21/2 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f0352/0x0/0x5f0353
Total Xmt 0/0 Dropped 0/0
Ingress: QID 0xac Entity: 0/0/0/4/21/4 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f035c/0x0/0x5f035d
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xc8 Entity: 0/0/0/4/25/0 Priority: Priority 1 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03e8/0x0/0x5f03e9
Total Xmt 3372382/697778537 Dropped 0/0
Egress: QID 0xc9 Entity: 0/0/0/4/25/1 Priority: Priority 2 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03ed/0x0/0x5f03ee
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xcb Entity: 0/0/0/4/25/3 Priority: Priority 3 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03f7/0x0/0x5f03f8
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xca Entity: 0/0/0/4/25/2 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03f2/0x0/0x5f03f3
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xcc Entity: 0/0/0/4/25/4 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03fc/0x0/0x5f03fd
Total Xmt 0/0 Dropped 0/0
To support more flexible QoS options for customers who want more than the default QoS mentioned in Section 4.4, we provide option for configuring regular MQC policies on EGRESS direction (no ingress support) with some limitations. The limitation in one simple sentence is that the MQC policy configured on an IRL does not have the ability to access the packet contents – that is, there is no way to figuring out whether the packet that goes out on IRL is ipv4 or ipv6 etc.. So none of the MQC features that needs to look into the packet will work. So how exactly is it used ?
Typical use case is that customer will configure an ingress MQC policy map on an regular (non-IRL) ingress interface. That ingress MQC policy can parse the packet and set a “qos-group” for the packet. The egress IRL policymap can then match on this qos-group and apply features like queueing and shaping. Random detect can also be applied (not based on dscp though – remember that needs access to packet contents) and of course no marking either.
The user is not prevented from applying any MQC policy on the IRL regardless of whether that policy has features unsupported on the IRL or not. That is no config level rejection of policies is done on the IRL interface yet (this might be enforced in later releases), so user has to take care to configure only supported features or else the behavior is unpredictable. For example if user configures an egress MQC policy on the IRL that does marking, then the packet going out of the IRL will have contents changed in some random location and that might cause those packets to be dropped !
The configuration of MQC on IRL and the show commands etc.. are exactly the same as MQC on a regular interface (remember IRL is just a regular interface !).
The packet that goes out on the IRL will have a Vlan encap with vlan hard-coded to vlan-id 1. The vlan-id really doesn’t matter, we just use the vlan COS bits to carry over the packet priority as mentioned in section 4.4. So that is 18 bytes overhead. In addition there is around 24 bytes of over head, which depends very much on the kind of packet (l3 / l2 / mcast etc..) being transported. So on average we have around 42 bytes overhead.
IRL load balances packets based on flow. How a “flow” is defined varies from feature to feature. In general, for any given feature, if we ask the question “how does this feature packet get load balanced across link bundle members”, the same answer would apply to load balancing across IRLs also. In other words, IRL load balancing obeys the exact same principles as link bundle member load balancing. In other words, a “32 bit” hash value is calculated for each packet/feature and that 32 bit hash value (with some bit flips etc.. to avoid polarization) would get used for IRL load balancing as well as link bundles.
Let us examine the different kinds of features in very brief below. This is by no means meant to be an exhaustive documentation of all the load balancing algorithms on the router, rather just to give an overview of the major classes of load balancing.
This is the standard tuple used for hash calculation for load balancing across link bundle members – like the source ip, dest ip, source port, dest port, protocol type. It does not matter whether the egress is IP or MPLS, the ingress is all that matters
If the incoming packet is MPLS, the forwarding engine looks deeper to see if the underlying packet is IP. If it is IP, then the standard IP hash tuple is used for calculating the hash. If the underlying packet is not IP, then just the labels from the label stack are used for calculating the hash. The label allocation mode (per CE or per VRF) has no impact on the hash.
There load balancing will be done based on src/dst mac addresses. Again, as explained initially this doesn’t become an exhaustive answer because there are scenarios where the VC label hash is used in vpls scenario.
For L2 flood traffic over link bundles, there are multiple elaborate modes of load balancing, the exhaustive documentation is probably best referred to along with the L2 link bundle documentation. But in general, there are two modes of load balancing that is tied to the flooding mode in L2.
In this mode, to restrict the L2 floods from reaching too many line cards, the hash is “statically” chosen based on bridge group. So some bridge groups will be “tied” to one IRL, few others to another IRL – same behaviour chosen for L2 over link bundles.
In this mode, the L2 flood is hashed in ucode based on the src/dst mac addresses.
L3 Multicast hashes multicast flows based on (S,G) and uses that hash to distribute packets across the IRLs – again the same technique used for distributing multicast packets across link bundle members.
There are four very simple rules that can always help in determining the primary-DSC and backup-DSC RSPs in an nV edge system.
1. Primary-DSC and backup-DSC both are always the “Active” RSP in each chassis. The “Active” here refers to the “Active” we know in the context of a single chassis ASR9K – where one RSP is “Active” and another is “Standby”
2. Primary-DSC and backup-DSC will always be on RSPs in different chassis.
3. If a Primary-DSC goes down, then the backup-DSC becomes primary-DSC. Then the chassis other than the one hosting the primary-DSC will select its “Active” RSP as the next backup-DSC (since the old backup just became primary).
4. If any RSP other than the primary-DSC or backup-DSC goes down, there is no change in the state of the primary-DSC or backup-DSC.
With these four rules in place, in any give scenario, we can figure out what happens if any of the RSPs in any of the chassis go down.
Before issuing redundancy switchover, it’s a good practice to check the control links in the system and check that there is at least one backup link available that can take over. For example in the output below, if we decide to issue “redundancy switchover” on 0/RSP0/CPU0, we have three more links (shown as “Blocking” or “On Partner RSP”) and one of them can take over as the link connecting control planes of both chassis (see Section 3.1 for details).
Some times it might happen that because of some fault (say fiber cut or bad sfp etc..), a few links are down in which case you wont see those links (neither as “Blocking” nor “On Partner RSP”). So unless there is at least one backup link, if we issue a switchover, then the only link that if “Forwarding” will go away and there wont be any more control plane connectivity across the chassis.
NOTE: We are enhancing the “redundancy switchover” CLI to automatically check this condition and disallow the cli to go through if there are no backup links. Till that enhancement is done, it is recommended to do this manual procedure.
RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/rSP0/CPU0
Priority lPort Remote_lPort UDLD STP
======== ===== ============ ==== ========
0 0/RSP0/CPU0/12 1/RSP0/CPU0/12 UP Forwarding
1 0/RSP0/CPU0/13 1/RSP1/CPU0/13 UP Blocking
2 0/RSP1/CPU0/12 1/RSP1/CPU0/12 UP On Partner RSP
3 0/RSP1/CPU0/13 1/RSP0/CPU0/13 UP On Partner RSP
In an ASR-9k nV Edge system, on failure of the Primary DSC node the RSP in the Backup DSC role becomes Primary, with the duties of being the system “master” RSP and hosting the active set of control plane processes. In the normal case for nV Edge, the Primary and Backup DSC nodes are hosted on separate racks. This means that the failure detection for the Primary DSC occurs via communication between racks.
The following mechanisms are used to detect RSP failures across rack boundaries:
Additionally messages are sent between racks for the purpose of Split Node avoidance / detection. These occur at 200ms intervals across the inter-chassis data links, and optionally can be configured redundantly across the RSP Management LAN interfaces.
Example HA Scenarios:
•1. Single RSP Failure of the Primary DSC node
The Standby RSP within the same chassis initially detects the failure via the backplane FPGA. On failure detection this RSP will transition to the active state and notify the Backup DSC node of the failure via the inter-chassis control link messaging.
•2. Failure of Primary DSC node and it’s Standby peer RSP.
There are multiple cases where this case can occur, such as power-cycle of the Primary DSC rack or simultaneous soft reset of both RSP cards within the Primary rack.
The remote rack failure will initially be detected by UDLD failure on the inter-chassis control link. The Backup DSC node checks the state if the UDLD on the inter-chassis data link. If the rack failure is confirmed by failure of the data link as well, then the Backup DSC node becomes active.
UDLD failure detection occurs in 500ms, however the time between control link and data link failure can vary since these are independent failures detected by the RSP and LC cards. A windowing period of up to 2 seconds is needed to correlate the control and data link failures, and to allow for split-brain detection messages to be received.
The keep-alive messaging between RSP acts as a redundant detection mechanism, should the UDLD detection fail to detect a stuck or reset RSP card.
•3. Failure of Inter-Chassis control links (Split Node)
Failure is initially detected by the UDLD protocol on the Inter-Chassis control links. Unlike the rack reload scenario above, the Backup DSC will continue receiving UDLD and keep-alive messages via the inter-chassis data link. Similar to the rack reload case, a 2 second windowing period is allowed to correlate the control/data link failures. If after 2 seconds the data link has not failed, or Split Node packets are being received across the Management LAN then the Backup DSC rack will reload to avoid the Split Node condition.
There are primarily two sets of links connecting the chassis in the nV edge system.
•1. Control links (recommended four of them)
•2. IRL links (minimum one)
So the two sets of links together will be at least FIVE wires. Let us see what can happen when there is a fault and a complete set of control links or IRL links or both go away (become faulty ?)
In this case, refer to Section 4.3 – both chassis will be up and functioning, but the interfaces on one of the chassis “might” get shut-down based on what config is present on the box (or whether its just the default config). Again, Section 4.3 should be referred to to understand what config is appropriate for you.
The two chassis in the nV edge system cannot function as “one entity” without control links. We have beacons that each chassis periodically exchanges over the IRL links. So if control links go down, then each chassis will know via the IRL beacons that the other chassis is UP, and one of the chassis has to just take itself down and go back to rommon.
Which chassis should go back to rommon ? The logical choice is the chassis hosting the Primary DSC RSP stays up, and the Non-Primary rack resets. Reason being that the chassis hosting the primary-DSC has all the “primary” protocol stacks and hence we want to avoid disturbing the protocols as much as possible. So we take the non-primary rack down to rommon and it tries to boot and join the nV edge system again – at some point if one or more control links become healthy again, that chassis will bootup and join the nV edge system again.
Since IOS-XR cannot stabilize with the control links severed in this way, the non-primary rack will continue to bootup, detect that the control links are down and reset until the connectivity issue is resolved.
The CLI command “show nv edge control control-link-protocols” can be used to assess the current status of the control links in the event of a problem.
In this scenario, we can “potentially” enter what is called a “Split Node” – where each chassis thinks that the other chassis has gone down and each of them declares itself as the master. So protocols like OSPF will start having two instances each with the same router-id etc.. and that can be a problem for the network.
So to try and mitigate this scenario, we provide one more set of “last gasp” paths via the management LAN network. On EACH RSP in the system, we should connect one of the two management LAN interfaces (any one of them) to an L2 network so that all four of those interfaces (from each RSP) can send L2 packets to each other. Then we can enter the below configuration on each of those management LAN interfaces.
So what this will do is that on each RSP, we will send high frequency beacons on these interfaces at 200 millisecond intervals. So if both chassis are functional, both chassis will get beacons from the other. And in such a scenario, if both chassis comes to know that both of them are working independently, then they know it’s a problematic scenario and one of them will take itself down. The chassis to reset will be the one that has been in the primary state for the least amount of time.
So this “Split Node” management lan path provides yet another alternate path to provide additional resiliency to try and avoid a nasty “Split Node” scenario.
But if the Control links AND IRL links AND split-brain management lan links ALL of them go away, then there’s no way to exchange any beacons across the chassis and then we will enter the split-brain scenario where both chassis starts functioning independently. In scenario such that the mgmt network on both chassis are not in the same subnet, or not in the same location, a L2 connection should be facilitated to provide the last gasp.
NOTE: The Split Node interface messages are meant to be “best effort” messages, currently we do not monitor for the “health” of those links. Those links are regular Management Ethernet interfaces and will have all the usual UP/DOWN traps etc.. But for example there are intermittent monitoring message drops on those links, then we do not raise any alarm or complaint. We might enhance this in future to include some monitoring of the packet drops (if any) on these links to alert the user.
The link bundle / BVI configuration on nV Edge requires a manual configuration of mac-address under the interface. An example for link bundle shown below
mac-address 26.51c5.e602 <== A mac like this needs to be configured explicity
Also for link bundle, the below lacp global configuration is also required
lacp system mac 0201.debf.0000
This caveat / requirement will be fixed in later release, till then we need to have this configuration for link bundles / BVIs / any virtual interfaces to work on nV Edge system.
lacp switchover suppress-flaps 15000
The “bundle manager” is a process that runs on the primary (DSC) and backup (backup-DSC) RSPs and is responsible for the configuration and state maintanence of the link bundle interfaces. When the primary (DSC) chassis in an nV Edge system is reloaded, the bundle-manager on the backup-DSC needs to “go active” and start connections to some external processes that provide other services (ICCP as an example). A Chassis reload is a much more “heavy” operataion compared to a regular RSP switchover because a chassis reload involves the going-down of all rsps and all line cards on that chassis and this cause quite a lot of control plane churn compared to a regular rsp switchover where theres only one node that goes away (one rsp). For example the basic infrastructure processes that handle the IPC (Inter Process Communication) in the system has to do a lot of “Cleanups”, they have to cleanup data structures corresponding to all the nodes that went away and flush packets from/to those nodes etc.. The routing protocols / rib has to process a lot of interface down notifications and start NSF / GR Etc.. Owing to this additional control plane load, when the bundle-manager asks for connecting to external “services”, those services will take more time to respond because they are already busy processing node down events.
Hence, the bundle-manager process might be “blocked” for a longer period of time compared to a regular swover scenario. So during this “blocked” time period, the remote end might time out and declare the bundle down. To prevent this, we have the “lacp switchover suppress-flap <seconds>” command. This needs to be configured on the nV Edge system AND the remote boxes (if remote is not IOS-XR box, whatever is the equivalent of that config in that box). This basically tells the link bundle to tolerate more control packet losses during this period.
In the example here, we have configured a 15 second tolerance – note that this DOES NOT mean that there will be a 15 second packet drop. Bundle manager will update the data plane to use a newly active link as soon as it gets the event which decides who is active (notification from peer in case of MC-LAG) and data can start flowing. All this does is to prevent bundle from going down if the rest of the bundle manager control plane is busy doing other stuff (like connecting to services) while the peer is expecting some control packets Rx/Tx.
NOTE: In 4.2.3 24I early image, at the time of writing this note, we are trying to optimize these “connection calls to services” and bring down their time requirement so that the suppress swover can also be reduced. There are more than one services involved, and hence we have to optimize multiple of them to get the suppress swover time requirement down. But worst case if there are some services which just refuse to be optimized with minimal work, the swover suppress config being a larger value (like 15 seconds) should not have any other detrimental side effects.
ASR9K nV Edge High Availability mode is unique in that it is probably the only High Availability model where we “expect” topology changes during a Backup to Primary Switchover like during a Rack / Chassis reload. If the Primary (DSC) chassis is reloaded, and if that chassis had IGP interface(s) on its line card(s), then when the Backup-DSC takes over as Primary-DSC, it has to do switchover processing AND at the same time process topology changes due to the loss of interfaces.
But as we know, for handling switchover cases gracefully, its normal that customers configure Non Stop Forwarding (NSF) under IGP protocols like ISIS. So now when the DSC Chassis is reloaded, the new DSC (old backup-DSC) will immediately start NSF on IGP (say ISIS) and as we know about regular NSF, it can take many seconds (default 90 seconds, can be changed by the nsf lifetime CLI) for NSF to be completed and the RIB will be informed about topology changes only AFTER NSF is complete.
So during this time frame, the new DSC chassis will have stale routes pointing to interfaces that are not existing any more (which were on the chassis that was reloaded). And this can lead to a large period of traffic loss. So what is the solution ? If we think through this problem, what we are asking for is the CEF / FIB to change the forwarding tables even though Routing Protocols / RIB has not asked it to do so. And this exactly fits the bill for the LFA-FRR feature. So without LFA-FRR, the convergence time during a chassis reload in an nV Edge system will be bad, LFA-FRR is a simple configuration, a basic example below. Note that LFA FRR can work with ECMP paths – one path in the ECMP list can backup the other path in the ECMP list.
router isis Cluster-L3VPN
address-family ipv4 unicast
address-family ipv4 unicast
BFD Multihop is one feature that is supported on a single chassis, but not on the nV Edge system.
The nV Edge system also doesn’t support clock / syncing features like syncE.
After configuring all the required caveats mentioned in Section 7, at the time of writing this in 4.2.3 24I early image time frame, the convergence number for an L3VPN profile with Access facing Link bundle (one member each from each chassis) and Core facing ECMP (two IGP links one from each chassis) with 3K eBGP sessions and one million routes is around 8 seconds for a Chassis Reload (any of the chassis) in the nv Edge System. The number for sure will be different for different profiles, each profile needs separate measurement and qualification / tuning. The obvious question can be that how much lower can it get ? The natural comparison that we end up doing is a comparison with an RSP failover. The factors that are (very) different between RSP failover and chassis reload are
Because of all these reasons, its almost impossible to achieve anything better than say 3 to 4 seconds (currently 8 seconds) for the L3VPN profile mentioned in the beginning of this section. And the delta 5 seconds might come after quite a high engineering investment towards it.
These clis are visible only for cisco-support users. There are many more CLIs than explained below, many of them are purely related to tuning the internal control port error-retry logic etc.. inside the driver and unlikely to be of use to anyone other than the engineers. Some of those explained below are quite “generic”, related to the UDLD protocol etc.. and hence explained below.
The SNMP agent and MIB specific configuration have no differences for the nV Edge scenario.
With upto four RSPs in an nV Edge system, and each chassis having an “Active / Standby” pair of RSPs and the nV Edge altogether having a “primar-DSC / backup-DSC” pair, there are multiple redundancy elements that come into picture. There is “node redundancy” which says in a given chassis, which node is “Active” and which node is “Standby”. There is a node-group redundancy which says in an nV Edge system, which is the “primary-DSC” and which is the “backup-DSC”. And there are “process groups” which have their own redundancy characteristics – for example protocol stacks (say ospf) have redundancy across the primary-DSC/backup-DSC pair. Where as some other “system” software elements will have redundancy across the “Active / Standby” RSPs in each chassis. This relationship is called “process groups” and their redundancy. The table below summarises the mibs.
Currently provides DSC chassis active/standby node pair info. In nV Edge scenario should provide DSC primary/backup RP info. Provides switchover notification.
Status only; no relationships
Provides redundancy state info for each node. No relationships indicated.
Extension to ENTITY-STATE-MIB which defines notifications (traps) on redundancy status changes.
Both status and relationships
Process group redundancy relationships & node status
Define redundancy group types:
Node redundancy pairs would be shown in groups with the node redundancy group type. Primary/backup nodes for each process group placed on them.
CISCO-RF-MIB is currently used to monitor the node redundancy of the DSC chassis’ active/standby RPs. The MIB definition is limited to representing redundancy relationships, status, and other info of only 2 nodes
CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes. The redundant node pairs are defined as redundancy groups with a group type indicating the group is a redundant node pair. The members of the group would be the nodes within the node-redundant pair.
Support for the CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes pertaining to the specific process groups. The redundant process groups are defined as redundancy groups with a group type indicating the group is a redundant process group. The members of the group would be the nodes where the primary and backup processes are placed for that process group.
The inventory information for each chassis and the respective physical entities will be available just as in the single chassis. The difference for ASR9K nV Edge (as in CRS multi-chassis) is the presence of a top-level entity in the hierarchy which acts as a container of the chassis entities. This entity will have entPhysicalClass value of ‘stack’.
IRL interface are in ALL respects just a regular IOS-XR interface. All the standard interface mibs for reporting errors / alarms / faults on the link will apply to the IRL links. Also all the standard mibs for the interface statistics will also apply to these links.
One missing MIB is for the “uni-directional” forwarding state of the IRL. For example if there is excessive packet loss on IRL which makes it go into a UDLD state of “uni-directional”, that is a fault scenario and that IRL link is removed from all forwarding tables, even though the physical state of the interface remains UP. This will be an enhancement required to get this event reported to MIB. One approach would be to just shut the link down on uni-directional fault so that the standard ifmib can trap this event.
The CRS Multi chassis system has implemented some MIBs for the Control Ethernet aspects of the system :- they are currently not implanted for the nv Edge system. But since the nV Edge system control Ethernet is very similar to CRS Multi Chassis Control Ethernet, we can implement those exacts MIBs for the nV Edge system also. That would be an enhancement work item.
The Control Ethernet MIB frontend is a collection of MIBs as below.
Below we down the most important syslog error messages that indicates some fault with the control Ethernet module or links.
LOG_INFO message: This message pops up if user inserts a Cisco unsupported SFP in the front panel SFP+ port. User has to replace the SFP with a Cisco supported one and the port will automatically get detected / used again.
LOG_CRIT message: This message pops up if a particular control Ethernet links has a fault and keeps “flapping” too frequently. If that happens then this port is disabled and will not be used for control link packet forwarding till user issues the above mentioned CLI.
ce_switch_srv: %PLATFORM-CE_SWITCH-6-UPDN : Interface 12 (SFP+_00_10GE) is down
These messages pops up whenever the Control Plane link (the front panel links) physical state changes up up/down – more like a regular interface up/down event notification. The “Interface 12” and “Interface 13” (the 12 and 13) are just internal numbers for the two front panel ports. These messages will pop up anytime a remote RSP goes down or boots up because at those instances the remote end laser goes down/up. But during normal operation of the nV Edge system when there are no RSP reboots etc.., these messages are not expected and indicates a problem with the link / sfp etc..
Here we describe the syslog / error messages related to the IRL links that can appear in the logs and describe them so that user is aware of what those messages mean.
Here the interface name being referred to can be found by saying “show im database ifhandle <interface handle>” – that particular interface has encountered a uni directional forwarding scenario and will be removed from the forwarding tables – no more data will be forwarded across those IRLs. We will try re-starting UDLD on that link again after 10 seconds to see if the UDLD can become bi-directional again, so this retry will keep happening every 10 seconds until the link goes bi-directional or the user decides to unconfigure “nv edge interface” on that link forever.
All the IRL links are present on the same line card (slot). This is not good for resiliency reasons. If that line card goes down, all the IRL links also go down. So the message periodically pops up asking the user to configure the IRLs to be spread across at least two slots.
The total number of IRLs in the system (maximum 16) is recommended to be spread across NO MORE than 5 line cards (slots). This is purely for debuggability reasons, debugging problems across more than 5 IRL LCs becomes a complex affair and hence a recommendation is to limit the spread to maxium 5 slots.
We recommend having at least two IRL links for resiliency reasons.
The output of all CLIs mentioned below can be redirected to a file / tftp server etc.. When in doubt as to which module traces to collect, its better just to collect all of the below.
If there are issues with the IRL links, please collect the below information. All CLIs in regular exec mode
1. show nv edge data trace all error location all
2. show nv edge data trace all event location all
If there are issues with control plane connectivity, please collect the below information. The below CLIs in the regular exec mode.
1. show nv edge control switch links detail location <each of the four RSPs>
2. show nv edge control control-link-protocols <each of the four RSPs>
3. show nv edge control clm-trace lib error location <each of the four RSPs>
4. show nv edge control clm-trace lib events location <each of the four RSPs>
5. show nv edge control control-link-debug-counts location <each of the four RSPs>
The below CLI in admin exec mode.
1. (admin)#show udld trace location <each of the four RSPs>
All the below CLIs are in admin exec mode.
1. show tech dsc <each of the four RSPs>
2. show dsc trace <each of the four RSPs>
3. show dsc <each of the four RSPs>
4. show dsc history <each of the four RSPs>
5. show dsc stats <each of the four RSPs>
Xander Thuijs CCIE #6775
Principal Engineer ASR9000
Content courtesy of the ASR9000 nV-edge team