on 06-11-2013 05:22 PM
Table of Contents
1 Glossary.. 3
2 Converting Single chassis ASR9K to nV Edge.. 4
2.1 Supported hardware and caveats. 6
2.2 Booting with different images on each chassis. 7
2.3 Configuring the Management Ethernet network for nV edge.. 7
3 nV Edge Control Plane.. 7
3.1 High Redundancy wiring (Recommended). 8
3.2 RSP in each chassis (NOT recommended). 10
3.3 Control Plane UDLD... 11
3.4 Control Link status CLI. 11
3.5 Control Link shut/no shut CLIs. 12
3.6 Miscellaneous control link CLIs. 13
4 nV Inter Rack Link (IRL) connections. 14
4.1 UDLD on IRL links. 15
4.2 What are the IRL links used for?.. 16
4.3 nV IRL “threshold monitor”.. 16
4.3.1 Backup-rack-interfaces config. 17
4.3.2 Specific-rack-interfaces config. 17
4.3.3 selected-interfaces config. 17
4.3.4 What is the default config. 17
4.4 Default QoS on IRL links. 18
4.5 Configurable QoS on IRL interfaces. 21
4.6 IRL packet encapsulation and overhead.. 21
4.7 IRL load balancing.. 22
4.7.1 Ingress IP packet 22
4.7.2 Ingress MPLS packet 22
4.7.3 L2 Unicast 22
4.7.4 L2 Flood.. 22
4.7.5 L3 Multicast 23
5 nV Edge Redundancy model.. 23
5.1 Redundancy switchover: Control Ethernet readiness. 24
5.2 RSP/Chassis failure Detection in ASR9k nV Edge.. 24
6 Split Node.. 26
6.1 All IRL links go away.. 26
6.2 All Control links go away.. 26
6.3 All Control AND IRL links go away.. 27
7 Feature configuration caveats. 28
7.1 Virtual Interfaces (Link bundle / BVI) mac-address. 28
7.2 Link Bundle “switchover suppress-flap” : Rack / Chassis reload.. 28
7.3 IGP protocols and LFA-FRR... 29
7.4 Multicast convergence during RACK reload or OIR... 30
8 Feature Gaps. 31
9 Convergence numbers. 31
10 “Debugging mode” CLIs – cisco support only.. 32
11 nV Edge MIBs. 33
11.1 Redundancy related MIBs. 33
11.1.1 Node Redundancy MIBs. 34
11.1.2 Process Redundancy MIBs. 34
11.1.3 Inventory Management 35
11.2 IRL monitoring MIBs. 37
11.3 Control Ethernet monitoring MIBs. 37
11.4 Control Ethernet Syslog / error messages. 37
11.5 Data Link Syslog / error messages. 38
12 Debugs and Traces. 39
13 Cluster Rack-By-Rack Upgrade.. 39
13.1 Overview.. 39
13.2 Prerequisites. 40
13.3 Upgrade Instructions (Scripted Method). 41
13.3.1 Script Setup.. 41
13.3.2 Script execution. 42
13.3.3 Verification. 42
13.4 Upgrade Instructions (Manual Method). 43
13.5 Install Abort Procedure.. 43
13.6 Converting a nV Edge Cluster to single chassis system... 43
nV – Network Virtualization
nV Edge – Network Virtualization on Edge routers
IRL – Inter Rack Links (for data forwarding)
Control Plane – the hardware and software infrastructure that deals with messaging / message passing across processes on the same or different nodes (RSPs or LCs).
Data Plane – the hardware and software infrastructure that deals with forwarding, generating and terminating data packets.
DSC – Designated Shelf Controller (the Primary RSP for the nV edge system)
Backup-DSC – Backup Designated Shelf Controller
UDLD – Uni Directional Link Detection protocol. An industry standard protocol used in Ethernet networks for monitoring link forwarding health.
FPD – Field Programmable Device (fpgas etc.. which can be upgraded).
This section assumes that the single chassis boxes are running 4.2.1 or later images, with the latest FPD versions. Check and correct this using the following commands on both chassis:
admin show hw-module fpd location all
admin upgrade hw-module fpd all location all
(admin)#show inventory chassis
NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"
PID: ASR-9006-AC, VID: V01, SN: FOX1435GV1C
NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"
PID: ASR-9006-AC, VID: V01, SN: FOX1429GJSV
The above configuration builds a “database” on Rack0 for the chassis serial numbers and assigned rack numbers. One purpose of this is to verify whether a chassis that tries to become part of this nV Edge system is allowed to be part of this nV edge or not, as a security mechanism.
RSP0 |
RSP1 |
RSP1 |
RSP0 |
Rack0 |
Rack1 |
SFP+0 |
SFP+0 |
SFP+0 |
SFP+0 |
SFP+1 |
SFP+1 |
SFP+1 |
SFP+1 |
NOTE: The Control Ethernet cabling should be done only after all the previous steps have been executed and both chassis are ready to “join” an nV Edge system. The control plane network should not be connected before the nV configuration is completed
NOTE: ALL the interfaces on the chassis having the backup-DSC RSP will be in SHUTDOWN state till at least one Inter-Rack Data Link is in forwarding state. Please refer Section 4.3 for more details.
At any time in the nV Edge system, one of the RSPs (in either Rack0 or Rack1) will be the “master” for the entire nV edge system. Another RSP in the system (again in Rack0 or Rack1) will be the “backup” for the entire nV edge system. The “master” is called a primary-DSC, using CRS Multi chassis terminology. The “backup” is called a backup-DSC. The primary-DSC will run all the primary protocol stacks (OSPF, BGP etc..) and the backup-DSC will run all the backup protocol stacks.
To find out which RSP is primary-DSC and which is backup-DSC, use the below command in admin exec mode.
RP/0/RSP0/CPU0:ios(admin)#show dsc
---------------------------------------------------------
Node ( Seq#) Role Serial# State
---------------------------------------------------------
0/RSP0/CPU0 ( 0) ACTIVE FOX1432GU2Z BACKUP-DSC
0/RSP1/CPU0 ( 1223769) STANDBY FOX1432GU2Z NON-DSC
1/RSP0/CPU0 ( 1279475) ACTIVE FOX1441GPND PRIMARY-DSC
1/RSP1/CPU0 ( 1279584) STANDBY FOX1441GPND NON-DSC
As can be seen above, the Rack1 RSP0 (1/RSP0/CPU0) is the primary-DSC and Rack0 RSP0 (0/RSP0/CPU0) is the backup-DSC. The Primary and Backup DSCs do not have any “affinity” towards any one chassis or any one RSP. Whichever chassis in the nV edge system boots up first will likely select one of its RSPs as the primary-DSC.
Another matter to note is that the “Active” / “Standby” states of the RSPs, which are familiar concepts in the single chassis mode of operation, are superseded by the primary-DSC backup-DSC functionality in an nV Edge system. For example, in a single chassis system, protocol stacks used to run on the Active and Standby RSPs in a single chassis as primary/backup protocol stacks. But as discussed in the preceding paragraph, that is no more the case in an nV Edge system – in nV edge, the primary-DSC and backup-DSC run the primary and backup protocol stacks.
In an nV Edge system, if for whatever reason, both chassis end up having non-identical XR software and/or SMUs installed, this could happen if a system is forced to boot a particular image by a ROMMON setting, then the chassis that boots up later will tell the dSC chassis(normally rack0) about its version details – the dSC chassis will “reject” that version if it doesn’t match.
Like a single chassis, one can configure the MgmtEth. interfaces on the nV edge cluster, the question often is which subnet to put the 4 interfaces in and what are the available options? Actually three options are available:
The nV Edge control plane provides software and hardware extensions to create a “unified” control plane for all the RSPs and line cards on both the nv Edge chassis. The control plane packets are forwarded from chassis to chassis “in hardware” as you will see in sections below. Control plane multicast etc.. is done in hardware for both the chassis – so there is no control plane performance impact because there are two chassis instead of one.
The nV Edge control plane links have to be direct L1 connections, there is no network or intermediate routing / switching devices allowed in between. Some details of the control plane connections are provided below to provide a better understanding of what exactly is the reasoning behind our recommendations. The control Ethernet links (front panel SFP+ ports) are configured in 1Gig mode of operation.
RSP0 |
RSP1 |
RSP1 |
RSP0 |
Rack0 |
Rack1 |
SFP+0 |
SFP+0 |
SFP+0 |
SFP+0 |
SFP+1 |
SFP+1 |
SFP+1 |
SFP+1 |
As seen in the diagram above, each RSP in each chassis has an Ethernet switch to which all the CPUs in the system (Line Card CPUs, RSP CPUs, any other CPUs in the system) connect to. So each CPU connects to two switches – one on each RSP. At any point in time, only one of the switches will be “active” and switching the control plane packets, the other will be “inactive” (regardless of whether system is nV edge or single chassis). And the “active” switch can be on either of the RSPs in the chassis, whichever switch can ensure the best connectivity across all the CPUs in the system.
The two SFP+ front panel ports on RSP-440 are just direct ports plugging into the switch on the RSP. So as shown in the diagram, of an nV Edge system, the simple goal is to connect each RSP (switch inside the RSP) to each switch on the remote chassis. So in the above case if any of the links go down, there are three possible backup links. Also at any point in time, only one of the links will be used for forwarding control plane data, all the other three links will be in “standby” state.
Connecting two chassis with just 2 EOBC links ie, RSP0 to RSP0 RSP1 to RSP method is NOT recommended and discouraged against, as it doesn’t provide the required resilience.
The control Ethernet is the heart of the system – if there is anything wrong with it, it can seriously degrade the nV edge system. So it is HIGHLY recommended to use all four control Ethernet links.
Here is a view of an RSP440 and 9001 EOBC ports, these ports cannot be used for anything other than EOBC, they cannot be used or configured as a L2 or L2 data port
In the case of a single RSP-per-chassis nV Edge topology, the below will be the wiring model. But again, this is not recommended because of resiliency reasons. If the only RSP in a chassis goes down, the entire chassis and all the line cards in the chassis also go down.
RSP0 |
RSP0 |
Rack0 |
Rack1 |
SFP+0 |
SFP+0 |
SFP+1 |
SFP+1 |
UDLD runs on the control plane links to ensure bi-directional forwarding health of the links. The UDLD is run at 200 msecs interval x 5 - ie, an expiry interval of 1 second. Which means that if a control link is uni-directional for 1 second, then the RSPs will take action to switch the control plane link to one of the three standby links.
Note that the one second detection is only for unidirectional failures – for a physical link fault (like fiber cut), there will be interrupts triggered with the fault and the link switchover to the standby links will happen in milliseconds.
The front panel SFP+ ports are referred to as ports “0” and “1” in the show command below. So each RSP has two of these ports, and the command below shows which port on which RSP is connected to which other port on which other RSP.
In the example below:
Port “0” on 0/RSP0 is connected to port “0” on 1/RSP0.
Port “1” on 0/RSP0 is connected to port “1” on 1/RSP1
Port “0” on 0/RSP1 is connected to port “0” on 1/RSP1
Port “1” on 0/RSP1 is connected to port “1” on 1/RSP0
Also, the “port pair” that is “active” and used for forwarding control Ethernet data is the link between port “12” on 0/RSP0 and port “12” on 1/RSP0 as shown in the state Forwarding below. All other links are just backup links.
The “CLM table version” is also a useful number to note. This number if it changes means that the control link UDLD is flapping. So in a good “stable” condition, that number should not change.
RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/RSP0/CPU0
Priority lPort Remote_lPort UDLD STP
======== ===== ============ ==== ========
0 0/RSP0/CPU0/0 1/RSP0/CPU0/0 UP Forwarding
1 0/RSP0/CPU0/1 1/RSP1/CPU0/1 UP Blocking
2 0/RSP1/CPU0/0 1/RSP1/CPU0/0 UP On Partner RSP
3 0/RSP1/CPU0/1 1/RSP0/CPU0/1 UP On Partner RSP
Active Priority is 0
Active switch is RSP0
CLM Table version is 2
Each RSP has two front panel EOBC link which are numbered as 0 and 1. The CLI to shut the links is as below
RP/1/RSP0/CPU0:A9K-Cluster-IPE(admin-config)#nv edge control control-link disable <0-1 > location <>
On shutting a control port, the CLI will also set a rommon variable on that RSP like “CLUSTER_0_DISABLE = 1” if port 0 is disabled and “CLUSTER_1_DISABLE = 1” if port 1 is disabled. As long as this rommon variable is set, neither rommon nor IOS-XR will ever enable that port.
The behavior when ALL the control links are shut is obviously that both chassis become DSC. But if the IRL links are active, then one of the chassis will reload and again, as soon as the IRL link comes back up, it will again reboot.
Currently this is the recommended procedure if all the control links are shutdown...
NOTE: The above is indeed a cumbersome and lengthy procedure (but only if we shut all control links). In 4.2.3 the procedure to unshut would be very simple – on whichever chassis that doesn’t reboot, go to admin config mode and just enter “no nv edge control control-link disable <port> <location>” and that will automatically take care of syncing it with the other chassis also.
SFP Plugged in : 0x00000001 (1)
SFP Rx LOS : 0x00000000 (0)
SFP Tx Fault : 0x00000000 (0)
SFP Tx Enabled : 0x00000001 (1)
The “SFP Plugged in” should be value 1 if there is an SFP present. The “SFP Rx LOS” should be 0 or else there is Rx Loss of Signal (an error !). The “SFP Tx Fault” should be 0 or else there is an SFP Fault (an error !). The “SFP Tx Enabled” should be 1 or else the SFP is not enabled from the control Ethernet driver (also an error !).
Supported EOBC SFPs.
In 4.2.1
SFP-GE-S= | 1000BASE-SX SFP (DOM), MMF, 550/220m |
In 4.3.0
SFP-GE-S= | 1000BASE-SX SFP (DOM), MMF, 550/220m |
GLC-SX-MMD= | 1000BASE-SX SFP, MMF, 850nm, 550m/220m, DOM |
Admin UP : 0x00000001 (1)
SFP supported cached : 0x00000001 (1)
PHY status register : 0x00000070 (112)
The “Admin UP” 0 would mean that customer has configured “nv edge control control-link-disable <port> <location>” CLI. Without that config, it should be value 1 which is the default. The “SFP supported cached” indicates whether user plugged in a Cisco supported SFP – value 1 means the SFP is supported, 0 means SFP is not supported. If the control link has an SFP plugged in and has a cable connected to a remote end and the remote end is also up and laser is good, link is good etc.., then the “PHY status register” should have a value of 0x70, it is an internal PHY register which says that the link is all good. If there is no cable or no SFP or bad cable or bad link etc.., it will not be value 0x70, this can be sometimes useful for Cisco support during debugging.
The IRL connections are required for forwarded traffic going from one chassis out of interface on the other chassis part of the nV edge system. The requirement for the IRL link is that it has to be a 10 Gig link and that they have to be direct L1 connections – no routed/switched devices are allowed in between. There can be a maximum of 16 such links between the chassis. Also recommended is a minimum of 2 links to offer resiliency(section 4.7 discusses load balancing across links), also that the two links be on two separate line cards, again for resiliency reasons in case one line card goes down due to any fault. The number of IRL links will need to be considered, this is based on the number of cards in the system, the expected traffic over IRL during a failure.
The configuration of an interface as IRL is simple, as shown below:
interface tenGigE 0/1/1/1
nv
edge
interface
!
Add this config to the IRL interfaces on both chassis of course! We run UDLD over these links to monitor bi-directional forwarding health.. Only when UDLD reports that the echo and echo response are all fine (standard UDLD state machine), then we place the interface into “Forwarding” state, till then the interface is in “Configured” state. So the IRL interface might be “Configured” but not “Forwarding”, once its both, then it will be used for forwarding the data across chassis.
RP/0/RSP0/CPU0:ios#show nv edge data forwarding location 0/RSP0/CPU0
nV Edge Data interfaces in forwarding state: 1
tenGigE 0_1_1_1 <--> tenGigE 1_1_0_1
nV Edge Data interfaces in configured state: 2
tenGigE 1_1_0_1
tenGigE 0_1_1_1
The above CLI says that there are two IRLs in “Configured” state (marked blue) – of course one on each Rack. The CLI also says that there is one “pair” of IRLs in “Forwarding” state (marked green). The “pair” is one from each rack. So the UDLD protocol automatically detects which interface is connected to which other and forms a “pair”.
So if you have configured IRLs, but you don’t see the line “nV Edge Data interfaces in forwarding state:” in your CLI output, then that means that something is wrong. We would recommend going through the standard interface checklist
-> Are the cables and SFPs all good ?
-> Are the interfaces unshut and Up/Up ?
-> Are there interface drops or errors ?
-> If you are conversant with the packet path, are there any other packet path drops ?
The UDLD timers on the IRL links are set to 40 milliseconds times 5 hellos, ie around 200 msecs as the expiry timeout. That means that any uni-directional problem with the IRL links will be detected & corrected in around 250 msecs (200 msecs + delta for processing overheads).
If you want to see the UDLD state machine on the line card hosting these links, then the below CLI can be used. The Interface [number in red] is what we call the “ifhandle”. The interface name corresponding to that can be displayed using the CLI “show im database ifhandle <number in red> location <line card>”.
In the example below, the UDLD state is Bidirectional, which is the desired correct state when things are working fine.
RP/0/RSP0/CPU0:ios#show nv edge data protocol all location 0/1/cPU0
Interface [0x60002c0][769]
---
Port enable administrative configuration setting: Enabled
Port enable operational state: Enabled
Current bidirectional state: Bidirectional
Current operational state: Advertisement - Single neighbor detected
Message interval: 20 msec
Time out interval: 10000 msec
Entry 1
---
Expiration time: 140 msec
Device ID: 1
Current neighbor state: Bidirectional
Device name: CLUSTER_RACK_01
Port ID: [0x46000100][769]
Neighbor echo 1 device: CLUSTER_RACK_00
Neighbor echo 1 port: [0x60002c0][769]
Message interval: 20 msec
Time out interval: 100 msec
CDP Device name: ASR9K CPU
The IRL links are used for forwarding packets whose ingress and egress interfaces are on separate racks. They are also used for all protocol Punt packets and protocol Inject packets. As explained in Section 2, the protocol stack “Primary” runs on the primary-DSC RSP in one of the chassis. So if a protocol punt packet comes in on an interface in another chassis, it has to be punted to the primary-DSC RSP in the remote chassis. This punt is done via the IRL. Similarly if the protocol stack on the primary-DSC wants to send a packet out of an interface on another chassis, that is also done via the IRL interfaces.
If the number of IRL links available for forwarding goes below a certain threshold, that might mean that the remaining IRLs will get congested and more and more inter-rack traffic will get dropped. So the IRL-monitor provides a way of shutting down other ports on the chassis if the number of IRL links goes below a threshold. The commands available are below
RP/0/RSP0/CPU0:ios(admin-config)#nv edge data minimum <minimum threshold> ?
backup-rack-interfaces Disable ALL interfaces on backup-DSC rack
selected-interfaces Disable only interfaces with nv edge min-disable config
specific-rack-interfaces Disable ALL interfaces on a specific rack
There are three modes of configuration possible.
With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on whichever chassis is hosting the backup-DSC RSP will be shut down. Again note that the backup-DSC RSP can be on either of the chassis.
With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on the specified rack (0 or 1) will be shut down.
With this configuration, if the number of IRLs go below the <minimum threshold> configured, the interfaces on any of the racks that are explicitly configured to be brought down will be shut down. How do we “explicitly” configure an interface (on any rack) to respond to IRL threshold events ?
RP/0/RSP0/CPU0:ios(config)#interface gigabitEthernet 0/1/1/0
RP/0/RSP0/CPU0:ios(config-if)#nv edge min-disable
RP/0/RSP0/CPU0:ios(config-if)#commit
So in the above example, if the number of IRLs go below the configured minimum threshold, interface Gig0/1/1/0 will be shut down.
The default config (if customer does not configure any of the above explicitly) is the equivalent of having configured “nv edge data minimum 1 backup-rack-interfaces”. Which means that if the number of IRLs in forwarding state goes below 1 (at least 1 forwarding IRL), then ALL the interfaces on whichever rack that has the backup-DSC, will get shut down. Meaning all traffic on that rack will stop being forwarded.
This might make some customers happy, some unhappy. This behavior can be turned off or changes through the following CLI “nv edge data minimum 0 backup-rack-interfaces” – basically this says that if the number of IRLs in forwarding state goes below 0 (which will never happen), only then we should bother shutting any interface on any rack.
When an interface is configured as an IRL link, we install 5 absolute priority queues on the port in both the ingress and egress directions. The priorities are below
The IRL links do not allow “user configurable” MQC policies on the IRL interfaces themselves. The classification of “punt / inject” and “multicast” are done “internally” in microcode – that is, other than being a punt/inject or multicast packet, there is no way by which we can “influence/force” a packet to go into the first two queues.
What packet gets into the last three queues can be influenced – just by having QoS ingress policies that mark packets appropriately to be a cos value of 0, 1 or 2. There is no other way by which we can influence what gets into these queues. The queue id selected on the ingress chassis’s IRL links is carried across in the Vlan COS bits, the egress chassis’s IRL that gets this packet will use this queue id encoded in the Vlan COS to select the queues it uses on Ingress (when it receives the packets from the remote chassis).
The CLI to display the nV edge qos queues is as below for example using an IRL interface with configs below. The subslot number 0 in the example is the “subslot” in which the MPA (the pluggable adaptor) is on the MOD-80/160 line card in the ASR9K. If the line card is not of a type that supports pluggable adaptors, just use 0 for subslot. The port number 1 used in the example is simply the last number in the 1/1/0/1 notation.
The drops (if any) in these queues are aggregated and reflected in the “show interface” drops also. The standard interface MIBs can be used for monitoring these drops. Note that the individual queue drops are not exported to MIBs, only the aggregate drops are exported as the interface drops. Also the IRL links are just regular interfaces, so the regular interface MIBs will all work on IRLs also.
RP/0/RSP0/CPU0:ios#sh running-config interface gigabitEthernet 1/1/0/1
interface GigabitEthernet1/1/0/1
nv
edge
interface
!
RP/0/RSP0/CPU0:ios#show qoshal cluster subslot 0 port 1 location 1/1/cPU0
Cluster Interface Queues : Subslot 0, Port 1
===============================================================
Port 1 NP 0 TM Port 17
Ingress: QID 0xa8 Entity: 0/0/0/4/21/0 Priority: Priority 1 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f0348/0x0/0x5f0349
Statistics(Pkts/Bytes):
Tx_To_TM 681762/140538069
Total Xmt 681762/140538069 Dropped 0/0
Ingress: QID 0xa9 Entity: 0/0/0/4/21/1 Priority: Priority 2 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f034d/0x0/0x5f034e
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Ingress: QID 0xab Entity: 0/0/0/4/21/3 Priority: Priority 3 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f0357/0x0/0x5f0358
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Ingress: QID 0xaa Entity: 0/0/0/4/21/2 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f0352/0x0/0x5f0353
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Ingress: QID 0xac Entity: 0/0/0/4/21/4 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f035c/0x0/0x5f035d
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xc8 Entity: 0/0/0/4/25/0 Priority: Priority 1 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03e8/0x0/0x5f03e9
Statistics(Pkts/Bytes):
Tx_To_TM 3372382/697778537
Total Xmt 3372382/697778537 Dropped 0/0
Egress: QID 0xc9 Entity: 0/0/0/4/25/1 Priority: Priority 2 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03ed/0x0/0x5f03ee
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xcb Entity: 0/0/0/4/25/3 Priority: Priority 3 Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03f7/0x0/0x5f03f8
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xca Entity: 0/0/0/4/25/2 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03f2/0x0/0x5f03f3
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
Egress: QID 0xcc Entity: 0/0/0/4/25/4 Priority: Priority Normal Qdepth: 0
StatIDs: commit/fast_commit/drop: 0x5f03fc/0x0/0x5f03fd
Statistics(Pkts/Bytes):
Tx_To_TM 0/0
Total Xmt 0/0 Dropped 0/0
RP/0/RSP0/CPU0:ios#
To support more flexible QoS options for customers who want more than the default QoS mentioned in Section 4.4, we provide an option for configuring regular MQC policies on the EGRESS direction (There is no ingress support) with some limitations. The limitation in one simple sentence is that the MQC policy configured on an IRL does not have the ability to access the packet contents – that is, there is no way to figuring out whether the packet that goes out on IRL is ipv4 or ipv6 etc.. So none of the MQC features that need to look into the packet will work. So how exactly is it used ?
Typical use case is that customer will configure an ingress MQC policy map on any regular (non-IRL) ingress interface. That ingress MQC policy can parse the packet and set a “qos-group” for the packet. The egress IRL policy-map can then match on this qos-group and apply features like queuing and shaping. Random detect can also be applied (not based on dscp though – remember that needs access to packet contents) and of course marking is not supported.
The user is not prevented from applying any MQC policy on the IRL regardless of whether that policy has features unsupported on the IRL or not. There is no config level rejection of policies done on the IRL interface yet (this might be enforced in later releases), so user has to take care to configure only supported features or else the behavior is unpredictable. For example if user configures an egress MQC policy on the IRL that does marking, then the packet going out of the IRL will have contents changed in some random location and that might cause those packets to be dropped in the node or at the host!
The configuration of MQC on IRL and the show commands etc.. are exactly the same as MQC on a regular interface (remember IRL is just a regular interface !).
The packet that goes out on the IRL will have a Vlan encapsulation with vlan hard-coded to vlan-id 1. The vlan-id really doesn’t matter, we just use the vlan COS bits to carry over the packet priority as mentioned in section 4.4. So that is 18 bytes overhead. In addition there is around 24 bytes of overhead, which depends very much on the kind of packet (l3 / l2 / mcast etc..) being transported. So on average we have around 42 bytes overhead.
IRL load balances packets based on flow. How a “flow” is defined varies from feature to feature. In general, for any given feature, if we ask the question “how does this feature packet get load balanced across link bundle members”, the same answer would apply to load balancing across IRLs also. In other words, IRL load balancing obeys the exact same principles as link bundle member load balancing. In other words, a “32 bit” hash value is calculated for each packet/feature and that 32 bit hash value (with some bit flips etc.. to avoid polarization) would get used for IRL load balancing as well as link bundles.
Let us examine the different kinds of features briefly. This is by no means meant to be an exhaustive documentation of all the load balancing algorithms on the router, rather just to give an overview of the major classes of load balancing.
This is the standard tuple used for hash calculation for load balancing across link bundle members – like the source ip, dest ip, source port, dest port, protocol type. It does not matter whether the egress is IP or MPLS, the ingress is all that matters
If the incoming packet is MPLS, the forwarding engine looks deeper to see if the underlying packet is IP. If it is IP, then the standard IP hash tuple is used for calculating the hash. If the underlying packet is not IP, then just the labels from the label stack are used for calculating the hash. The label allocation mode (per CE or per VRF) has no impact on the hash.
There load balancing will be done based on src/dst mac addresses. Again, as explained initially this doesn’t become an exhaustive answer because there are scenarios where the VC label hash is used in vpls scenario.
For L2 flood traffic over link bundles, there are multiple elaborate modes of load balancing, the exhaustive documentation is probably best referred to along with the L2 link bundle documentation. But in general, there are two modes of load balancing that is tied to the flooding mode in L2.
In this mode, to restrict the L2 floods from reaching too many line cards, the hash is “statically” chosen based on bridge group. So some bridge groups will be “tied” to one IRL, few others to another IRL – same behaviour chosen for L2 over link bundles.
In this mode, the L2 flood is hashed in ucode based on the src/dst mac addresses.
L3 Multicast hashes multicast flows based on (S,G) and uses that hash to distribute packets across the IRLs – again the same technique used for distributing multicast packets across link bundle members.
There are four very simple rules that can always help in determining the primary-DSC and backup-DSC RSPs in an nV edge system.
With these four rules in place, in any give scenario, we can figure out what happens if any of the RSPs in any of the chassis go down.
Before issuing redundancy switchover, it’s a good practice to check the control links in the system and check that there is at least one backup link available that can take over. For example in the output below, if we decide to issue “redundancy switchover” on 0/RSP0/CPU0, we have three more links (shown as “Blocking” nor as “On Partner RSP”) and one of them can take over as the link connecting control planes of both chassis (see Section 3.1 for details).
Sometimes it might happen that because of some fault (say fiber cut or bad sfp etc..), a few links are down in which case you won’t see those links (neither as “Blocking” nor “On Partner RSP”). So unless there is at least one backup link, if we issue a switchover, then the only link that is “Forwarding” will go away and there won’t be any more control plane connectivity across the chassis.
NOTE: We are enhancing the “redundancy switchover” CLI to automatically check this condition and disallow the cli to go through if there are no backup links. Until this enhancement is implemented, it is recommended to do this manual procedure.
RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/RSP0/CPU0
Priority lPort Remote_lPort UDLD STP
======== ===== ============ ==== ========
0 0/RSP0/CPU0/12 1/RSP0/CPU0/12 UP Forwarding
1 0/RSP0/CPU0/13 1/RSP1/CPU0/13 UP Blocking
2 0/RSP1/CPU0/12 1/RSP1/CPU0/12 UP On Partner RSP
3 0/RSP1/CPU0/13 1/RSP0/CPU0/13 UP On Partner RSP
In an ASR-9k nV Edge system, on failure of the Primary DSC node the RSP in the Backup DSC role becomes Primary, with the duties of being the system “master” RSP and hosting the active set of control plane processes. In the normal case for nV Edge, the Primary and Backup DSC nodes are hosted on separate racks. This means that the failure detection for the Primary DSC occurs via communication between racks.
The following mechanisms are used to detect RSP failures across rack boundaries:
Additionally messages are sent between racks for the purpose of Split Node avoidance / detection. These occur at 200ms intervals across the inter-chassis data links, and optionally can be configured redundantly across the RSP Management LAN interfaces. Refer to section 6.5 below.
Example HA Scenarios:
The Standby RSP within the same chassis initially detects the failure via the backplane FPGA. On failure detection this RSP will transition to the active state and notify the Backup DSC node of the failure via the inter-chassis control link messaging.
There are multiple cases where this case can occur, such as power-cycle of the Primary DSC rack or simultaneous soft reset of both RSP cards within the Primary rack.
The remote rack failure will initially be detected by UDLD failure on the inter-chassis control link. The Backup DSC node checks the state if the UDLD on the inter-chassis data link. If the rack failure is confirmed by failure of the data link as well, then the Backup DSC node becomes active.
UDLD failure detection occurs in 500ms, however the time between control link and data link failure can vary since these are independent failures detected by the RSP and LC cards. A windowing period of up to 2 seconds is needed to correlate the control and data link failures, and to allow for split-brain detection messages to be received.
The keep-alive messaging between RSP acts as a redundant detection mechanism, should the UDLD detection fail to detect a stuck or reset RSP card.
Failure is initially detected by the UDLD protocol on the Inter-Chassis control links. Unlike the rack reload scenario above, the Backup DSC will continue receiving UDLD and keep-alive messages via the inter-chassis data link. Similar to the rack reload case, a 2 second windowing period is allowed to correlate the control/data link failures. If after 2 seconds the data link has not failed, or Split Node packets are being received across the Management LAN then the Backup DSC rack will reload to avoid the Split Node condition.
There are primarily two sets of links connecting the chassis in the nV edge system.
So the two sets of links together will be at least FIVE wires. Let us see what can happen when there is a fault and a complete set of control links or IRL links or both go away (become faulty ?)
FOUR Control Plane Links |
At least one IRL link |
Chassis1 |
Chassis0 |
In this case, refer to Section 4.3 – both chassis will be up and functioning, but the interfaces on one of the chassis “might” get shut-down based on what config is present on the box (or whether its just the default config). Again, Section 4.3 should be referred to to understand what config is appropriate for you.
The two chassis in the nV edge system cannot function as “one entity” without control links. We have beacons that each chassis periodically exchanges over the IRL links. So if control links go down, then each chassis will know via the IRL beacons that the other chassis is UP, and one of the chassis has to just take itself down and go back to rommon.
Which chassis should go back to rommon ? The logical choice is the chassis hosting the Primary DSC RSP stays up, and the Non-Primary rack resets. Reason being that the chassis hosting the primary-DSC has all the “primary” protocol stacks and hence we want to avoid disturbing the protocols as much as possible. So we take the non-primary rack down to rommon and it tries to boot and join the nV edge system again – at some point if one or more control links become healthy again, that chassis will bootup and join the nV edge system again.
Since IOS-XR cannot stabilize with the control links severed in this way, the non-primary rack will continue to bootup, detect that the control links are down and reset until the connectivity issue is resolved.
The CLI command “show nv edge control control-link-protocols” can be used to assess the current status of the control links in the event of a problem.
In this scenario, we can “potentially” enter what is called a “Split Brain” – where each chassis thinks that the other chassis has gone down and each of them declares itself as the master. So protocols like OSPF will start having two instances each with the same router-id etc.. and that can be a problem for the network.
So to try and mitigate this scenario, we provide one more set of “last gasp” paths via the management LAN network. On EACH RSP in the system, we should connect one of the two management LAN interfaces (any one of them) to an L2 network so that all four of those interfaces (from each RSP) can send L2 packets to each other. Then we can enter the below configuration on each of those management LAN interfaces.
interface MgmtEth0/RSP0/CPU0/1
nv
edge
split-brain
!
So what this will do is that on each RSP, we will send high frequency beacons on these interfaces at 200 millisecond intervals. So if both chassis are functional, both chassis will get beacons from the other. And in such a scenario, if both chassis comes to know that both of them are working independently, then they know it’s a problematic scenario and one of them will take itself down. The chassis to reset will be the one that has been in the primary state for the least amount of time.
So this “Split Node” management lan path provides yet another alternate path to provide additional resiliency to try and avoid a nasty “Split Node” scenario.
But if the Control links AND IRL links AND split-brain management LAN links ALL of them go away, then there’s no way to exchange any beacons across the chassis and then we will enter the split-brain scenario where both chassis starts functioning independently. In scenario such that the mgmt network on both chassis are not in the same subnet, or not in the same location, a L2 connection should be facilitated to provide the last gasp.
NOTE: The Split Node interface messages are meant to be “best effort” messages, currently we do not monitor for the “health” of those links. Those links are regular Management Ethernet interfaces and will have all the usual UP/DOWN traps etc.. But for example there are intermittent monitoring message drops on those links, then we do not raise any alarm or complaint. We might enhance this in future to include some monitoring of the packet drops (if any) on these links to alert the user.
The link bundle / BVI configuration on nV Edge requires a manual configuration of mac-address under the interface. An example for link bundle shown below
interface Bundle-Ether15
mac-address 26.51c5.e602 <== A mac like this needs to be configured explicity
Also for link bundle, the below lacp global configuration is also required
lacp system mac 0201.debf.0000
This caveat / requirement will be fixed in later release, till then we need to have this configuration for link bundles / BVIs / any virtual interfaces to work on nV Edge system.
interface Bundle-Ether15
lacp switchover suppress-flaps 15000
The “bundle manager” is a process that runs on the primary (DSC) and backup (backup-DSC) RSPs and is responsible for the configuration and state maintenance of the link bundle interfaces. When the primary (DSC) chassis in an nV Edge system is reloaded, the bundle-manager on the backup-DSC needs to “go active” and start connections to some external processes that provide other services (ICCP as an example). A Chassis reload is a much more “heavy” operation compared to a regular RSP switchover because a chassis reload involves the restart of all RSPs and all line cards on that chassis and this cause quite a lot of control plane churn compared to a regular rsp switchover where theres only one node that goes away (one rsp). For example the basic infrastructure processes that handle the IPC (Inter Process Communication) in the system has to do a lot of “Cleanups”, they have to cleanup data structures corresponding to all the nodes that went away and flush packets from/to those nodes etc.. The routing protocols / rib has to process a lot of interface down notifications and start NSF / GR Etc.. Owing to this additional control plane load, when the bundle-manager asks for connecting to external “services”, those services will take more time to respond because they are already busy processing node down events.
Hence, the bundle-manager process might be “blocked” for a longer period of time compared to a regular swover scenario. So during this “blocked” time period, the remote end might time out and declare the bundle down. To prevent this, we have the “lacp switchover suppress-flap <seconds>” command. This needs to be configured on the nV Edge system AND the remote boxes (if remote is not IOS-XR box, whatever is the equivalent of that config in that box). This basically tells the link bundle to tolerate more control packet losses during this period.
In the example here, we have configured a 15 second tolerance – note that this DOES NOT mean that there will be a 15 second packet drop. Bundle manager will update the data plane to use a newly active link as soon as it gets the event which decides who is active (notification from peer in case of MC-LAG) and data can start flowing. All this does is to prevent bundle from going down if the rest of the bundle manager control plane is busy doing other stuff (like connecting to services) while the peer is expecting some control packets Rx/Tx.
ASR9K nV Edge High Availability mode is unique in that it is probably the only High Availability model where we “expect” topology changes during a Backup to Primary Switchover like during a Rack / Chassis reload. If the Primary (DSC) chassis is reloaded, and if that chassis had IGP interface(s) on its line card(s), then when the Backup-DSC takes over as Primary-DSC, it has to do switchover processing AND at the same time process topology changes due to the loss of interfaces.
But as we know, for handling switchover cases gracefully, it is normal that customers configure Non Stop Forwarding (NSF) under IGP protocols like ISIS and OSPF. So now when the DSC Chassis is reloaded, the new DSC (old backup-DSC) will immediately start NSF on IGP (say ISIS) and as we know about regular NSF, it can take many seconds (default 90 seconds, can be changed by the nsf lifetime CLI) for NSF to be completed and the RIB will be informed about topology changes only AFTER NSF is complete.
So during this time frame, the new DSC chassis will have stale routes pointing to interfaces that are not existing any more (which were on the chassis that was reloaded). And this can lead to a large period of traffic loss. So what is the solution ? If we think through this problem, what we are asking for is the CEF / FIB to change the forwarding tables even though Routing Protocols / RIB has not asked it to do so. And this exactly fits the bill for the LFA-FRR feature. So without LFA-FRR, the convergence time during a chassis reload in an nV Edge system will be bad, LFA-FRR is a simple configuration, a basic example below. Note that LFA FRR can work with ECMP paths – one path in the ECMP list can backup the other path in the ECMP list.
router isis Cluster-L3VPN
<snip>
interface Loopback0
address-family ipv4 unicast
!
!
interface TenGigE0/1/0/5
address-family ipv4 unicast
fast-reroute per-link
When you do rack OIR/Reload, the PIM in old standby/new active rack starts fresh (PIM is not hot standy). It triggers NSF for first 3 minutes.
By the time NSF ends, it downloads the routes to mfib and further to PD. Until this time, the A flag is not set on the rpf interface. Packets are dropped.
The difference in the case of rack OIR is, LC also goes through restart which results in topology change. However, since the new change cannot be downloaded to
PD, the update does not happen and packets are dropped. Compare this with the case of regular Switch over where only the RP node under goes a reload. In this case
Since LC remains unaffected even though mrib is under NSF window, the packets continue to be switched using old route.
To mitigate this, one needs to configure link bundles on all interfaces that have multicast flows on this, and this bundle needs to have member links in both racks, this allows a rack OIR without changing the state of the bundle interfaces.
BFD Multihop is one feature that is supported on a single chassis, but not on the nV Edge system.
The nV Edge system also doesn’t support clock / syncing features like syncE.
nV Edge is only recommended with dual RSPs in each chassis due to the EOBC redundancy design. The EoBC of the ASR9001 is designed without RSP redundancy in mind, so it’s not exactly the same as chassis that support dual RSP.
After configuring all the required caveats mentioned in Section 7, at the time of writing this in 4.2.3 24I early image time frame, the convergence number for an L3VPN profile with Access facing Link bundle (one member each from each chassis) and Core facing ECMP (two IGP links one from each chassis) with 3K eBGP sessions and one million routes is around 8 seconds for a Chassis Reload (any of the chassis) in the nv Edge System. The number for sure will be different for different profiles, each profile needs separate measurement and qualification / tuning. The obvious question can be that how much lower can it get ? The natural comparison that we end up doing is a comparison with an RSP failover. The factors that are (very) different between RSP failover and chassis reload are
Because of all these reasons, its almost impossible to achieve anything better than say 3 to 4 seconds (currently 8 seconds) for the L3VPN profile mentioned in the beginning of this section. And the delta 5 seconds might come after quite a high engineering investment towards it.
These clis are visible only for cisco-support users. There are many more CLIs than explained below, many of them are purely related to tuning the internal control port error-retry logic etc.. inside the driver and unlikely to be of use to anyone other than the engineers. Some of those explained below are quite “generic”, related to the UDLD protocol etc.. and hence explained below.
The SNMP agent and MIB specific configuration have no differences for the nV Edge scenario.
With upto four RSPs in an nV Edge system, and each chassis having an “Active / Standby” pair of RSPs and the nV Edge altogether having a “primary-DSC / backup-DSC” pair, there are multiple redundancy elements that come into picture. There is “node redundancy” which says in a given chassis, which node is “Active” and which node is “Standby”. There is a node-group redundancy which says in an nV Edge system, which is the “primary-DSC” and which is the “backup-DSC”. And there are “process groups” which have their own redundancy characteristics – for example protocol stacks (say ospf) have redundancy across the primary-DSC/backup-DSC pair. Whereas some other “system” software elements will have redundancy across the “Active / Standby” RSPs in each chassis. This relationship is called “process groups” and their redundancy. The table below summarises the mibs.
MIB | Node Redundancy | Process Redundancy | Description |
CISCO-RF-MIB | Currently provides DSC chassis active/standby node pair info. In nV Edge scenario should provide DSC primary/backup RP info. Provides switchover notification. | ||
ENTITY-STATE-MIB | Status only; no relationships | Provides redundancy state info for each node. No relationships indicated. | |
CISCO-ENTITY-STATE-EXT-MIB | Extension to ENTITY-STATE-MIB which defines notifications (traps) on redundancy status changes. | ||
CISCO-ENTITY-REDUNDANCY-MIB | Both status and relationships | Process group redundancy relationships & node status | Define redundancy group types:
Node redundancy pairs would be shown in groups with the node redundancy group type. Primary/backup nodes for each process group placed on them. |
CISCO-RF-MIB is currently used to monitor the node redundancy of the DSC chassis’ active/standby RPs. The MIB definition is limited to representing redundancy relationships, status, and other info of only 2 nodes
CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes. The redundant node pairs are defined as redundancy groups with a group type indicating the group is a redundant node pair. The members of the group would be the nodes within the node-redundant pair.
Support for the CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes pertaining to the specific process groups. The redundant process groups are defined as redundancy groups with a group type indicating the group is a redundant process group. The members of the group would be the nodes where the primary and backup processes are placed for that process group.
The inventory information for each chassis and the respective physical entities will be available just as in the single chassis. The difference for ASR9K nV Edge (as in CRS multi-chassis) is the presence of a top-level entity in the hierarchy which acts as a container of the chassis entities. This entity will have entPhysicalClass value of ‘stack’.
Rack 0 -- index 24555730 entPhysicalClass = ‘chassis’ entPhysicalContainedIn = 1 entPhysicalParentRelPos = 0 |
Slot 0/0 -- index 28091685 entPhysicalClass = ‘container’ entPhysicalContainedIn = 24555730 entPhysicalParentRelPos = 0 |
... |
... |
Rack 1 -- index 141995845 entPhysicalClass = ‘chassis’ entPhysicalContainedIn = 1 entPhysicalParentRelPos = 1 |
Slot 0/0 -- index 139707424 entPhysicalClass = ‘container’ entPhysicalContainedIn = 141995845 entPhysicalParentRelPos = 0 |
... |
... |
Rack N -- index 1481742692 entPhysicalClass = ‘chassis’ entPhysicalContainedIn = 1 entPhysicalParentRelPos = N |
Slot 0/0 -- index 1523535239 entPhysicalClass = ‘container’ entPhysicalContainedIn = 1481742692 entPhysicalParentRelPos = 0 |
... |
... |
Stack -- index 1 entPhysicalClass = ‘stack’ entPhysicalContainedIn = 0 entPhysicalParentRelPos = -1 |
IRL interface are in ALL respects just a regular IOS-XR interface. All the standard interface mibs for reporting errors / alarms / faults on the link will apply to the IRL links. Also all the standard mibs for the interface statistics will also apply to these links.
One missing MIB is for the “uni-directional” forwarding state of the IRL. For example if there is excessive packet loss on IRL which makes it go into a UDLD state of “uni-directional”, that is a fault scenario and that IRL link is removed from all forwarding tables, even though the physical state of the interface remains UP. This will be an enhancement required to get this event reported to MIB. One approach would be to just shut the link down on uni-directional fault so that the standard ifmib can trap this event.
The CRS Multi chassis system has implemented some MIBs for the Control Ethernet aspects of the system :- they are currently not implanted for the nv Edge system. But since the nV Edge system control Ethernet is very similar to CRS Multi Chassis Control Ethernet, we can implement those exacts MIBs for the nV Edge system also. That would be an enhancement work item.
The Control Ethernet MIB frontend is a collection of MIBs as below.
.
Below we down the most important syslog error messages that indicates some fault with the control Ethernet module or links.
LOG_INFO message: This message pops up if user inserts a Cisco unsupported SFP in the front panel SFP+ port. User has to replace the SFP with a Cisco supported one and the port will automatically get detected / used again.
LOG_CRIT message: This message pops up if a particular control Ethernet links has a fault and keeps “flapping” too frequently. If that happens then this port is disabled and will not be used for control link packet forwarding.
ce_switch_srv[53]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 12 (SFP+_00_10GE) is down
These messages pops up whenever the Control Plane link (the front panel links) physical state changes up up/down – more like a regular interface up/down event notification. The “Interface 12” and “Interface 13” (the 12 and 13) are just internal numbers for the two front panel ports. These messages will pop up anytime a remote RSP goes down or boots up because at those instances the remote end laser goes down/up. But during normal operation of the nV Edge system when there are no RSP reboots etc.., these messages are not expected and indicates a problem with the link / sfp etc..
Here we describe the syslog / error messages related to the IRL links that can appear in the logs and describe them so that user is aware of what those messages mean.
Here the interface name being referred to can be found by saying “show im database ifhandle <interface handle>” – that particular interface has encountered a uni directional forwarding scenario and will be removed from the forwarding tables – no more data will be forwarded across those IRLs. We will try re-starting UDLD on that link again after 10 seconds to see if the UDLD can become bi-directional again, so this retry will keep happening every 10 seconds until the link goes bi-directional or the user decides to unconfigure “nv edge interface” on that link forever.
All the IRL links are present on the same line card (slot). This is not good for resiliency reasons. If that line card goes down, all the IRL links also go down. So the message periodically pops up asking the user to configure the IRLs to be spread across at least two slots.
The total number of IRLs in the system (maximum 16) is recommended to be spread across NO MORE than 5 line cards (slots). This is purely for debuggability reasons, debugging problems across more than 5 IRL LCs becomes a complex affair and hence a recommendation is to limit the spread to maxium 5 slots.
We recommend having at least two IRL links for resiliency reasons.
The output of show tech mentioned below can be redirected to a file / tftp server etc.. Use when in doubt as to which module traces to collect.
ISSU is not supported on cluster, let that be very clear. Though in the event of a software upgrade from any to any release, or during a SMU installation, it’s highly recommended that the following steps are following to avoid the standard 10 minutes or so of reload time after an upgrade. The method used here upgrades each system separately. The assumption here is the network is fully redundant and all links are dual homed to each of the chassis in the cluster, which translates to continuous connectivity while any one of the chassis in the cluster is down. The method here is scripted and an off system server/pc must be used to execute the script.
Rack-By-Rack reload is a method of upgrading, or installing disruptive software (ie reload SMUs) on the Cluster one rack at a time, in order to reduce the amount of traffic downtime compared to a full system reloads...
At a high level, the upgrade steps are as follows:
Due to the complexity of the CLI steps used, it is recommended to use the scripted method below.
The upgrade script may be obtained by copying it from the router to a tftphost via the "copy" command. The file is located on the router at: "#run /pkg/bin/nv_edge_upgrade.exp".
This script must be customized to your particular install. This is done by modification of the variables at the top of the script. The required changes are:
An example of the script configuration variables is below:
set rack0_addr "172.27.152.19"
set rack0_port "2002"
set rack0_stby_addr "172.27.152.19"
set rack0_stby_port "2004"
set rack1_addr "172.27.152.19"
set rack1_port "2005"
set rack1_stby_addr "172.27.152.19"
set rack1_stby_port "2007"
set router_username "root"
set router_password "root"
set image_list "disk0:asr9k-mini-px-4.2.3 \
disk0:asr9k-services-p-px-4.2.3 \
disk0:asr9k-px-4.2.3.CSCuc40191-0.0.2.i"
set irl_list {{Teng 0/1/1/2} {Teng 1/1/0/2}}
In this example, the console ports of all four RSP's of the cluster are connected to 172.27.152.19, and the ports are specified. The router login is root/root, three software packages are intended to be activated, and the script expects only one IRL link as specified.
To begin the install activation via the script exit all consoles completely (exit to login prompt), and disconnect all serial and telnet connections to the management console of the router. Execute the script from an external linux workstation as below:
sjc-lds-904:> <strong>nv_edge_upgrade.exp </strong>
########################
This CLI Script performs a software upgrade on
an ASR9k Nv Edge system, using a rack-by-rack
parallel reload method. This script will modify
the configuration of the router, and will incur
traffic loss.
Do you wish to continue [y/n] <strong>y</strong>
spawn telnet 172.27.152.19 2002
Trying 172.27.152.19...
Connected to 172.27.152.19.
Escape character is '^]'.
RP/0/RSP0/CPU0:ios#
In the example here, the script is executed by typing "nv_edge_upgrade.exp". Please ensure that the script is given execution file privileges. When prompted if you wish to continue the software activation, enter "y" to continue.
At various points during the upgrade process the script will enter into a waiting period and display a message as below:
--- WAITING FOR INSTALL ACTIVATE RACK 0 60 SECONDS (~~ to abort / + to add time) ---
CLI commands may be entered at this time to check the router status during the upgrade process. This is intended to allow sufficient time for the various steps of the upgrade to complete, and for the router to achieve a stable state before continuing. It is important that no configuration changes are made while the prompt is available.
The script will run to completion in approximately 45 minutes.
Once the script runs to completion, please connect to the router, verify that the platform is in working order, and that routing and traffic have resumed. Loss of topology and some loss of traffic is expected during the upgrade process. Expected traffic loss is between 30 seconds and 4 minutes on "normal scale" systems, and can be as long as 10 minutes in high scale scenarios.
Install commit is included in the script execution. To revert to the prior release after script completion, a separate install operation is needed. Reload of the system will not cause an install revert.
The upgrade process can be executed entering the CLI commands directly onto the console instead of using the provided script. This is not recommended, as the upgrade process is sensitive to the ordering and timing of various steps of the upgrade. If a CLI command is omitted, or the commands entered in the incorrect order it may have catastrophic effect.
Within the script a variable is defined "debug_mode". Set this to "1", and then execute the script from the linux prompt. This will cause the script to output the CLI commands to the terminal window, and can be used as a basis for the manual upgrade.
Abort of the software installation is allowed at or anytime prior to the following output message:
--- WAITING FOR INSTALL COMMIT 10 SECONDS (~~ to abort / + to add time) ---
The Abort procedure is as follows:
Rack 1 will automatically sync to the prior software load running on Rack 0.
It’s possible to change an nV Edge system back to two separate single chassis systems. The steps to do this are fairly simple, though console access is required to all RSPs.
RP/0/RSP0/CPU0:A9K-PE1(admin)#config-register 0x0
Sat Mar 23 09:21:38.700 UTC
Successfully set config-register to 0x0 on node 0/RSP0/CPU0
Successfully set config-register to 0x0 on node 0/RSP1/CPU0
Successfully set config-register to 0x0 on node 1/RSP0/CPU0
Successfully set config-register to 0x0 on node 1/RSP1/CPU0
RP/0/RSP0/CPU0:A9K-PE1(admin)#
ROMMON> unset CLUSTER_RACK_ID
ROMMON> sync
At this point both chassis are separated, care needs to be taken, since the chassis will have the same config, hence the same router id, that could lead to protocol instability and duplicate system IDs.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: