cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
13770
Views
5
Helpful
24
Comments
xthuijs
Cisco Employee
Cisco Employee

 

 

Introduction

In this document we'll show you how to configure 2 ASR9000 (of the same kind) into a cluster setup.

Cluster provides a significant advantage over 2 separate single physical chassis by simplifying management (the 2 nodes will act as a single entity) while maintaining state of the art redundancy.

In cluster, a device can dual home into each of the nodes (known as "racks") with for instance a bundle ethernet or ether channel, and since the 2 racks are a single physical entity, there is only one routing peering, so no need for ECMP. Also there is no need for MC-LAG or other complexities for L2 environments.

 

 

1       Glossary

 

nV –                  Network Virtualization

nV Edge –         Network Virtualization on Edge routers

IRL –                 Inter Rack Links (for data forwarding)

Control Plane – the hardware and software infrastructure that deals with messaging / message passing across processes on the same or different nodes (RSPs or LCs).

Data Plane –    the hardware and software infrastructure that deals with forwarding, generating and terminating data packets.

DSC –              Designated Shelf Controller (the Primary RSP for the nV edge system)

Backup-DSC – Backup Designated Shelf Controller

UDLD –            Uni Directional Link Detection protocol. An industry standard protocol used in Ethernet networks for monitoring link forwarding health.

FPD –               Field Programmable Device (fpgas etc.. which can be upgraded).

 

2      Converting Single chassis ASR9K to nV Edge

 

This section assumes that the single chassis boxes are running 4.2 or earlier images. If they are already running 4.2.1 or later, we might be able to avoid the first two steps. Take note of the general release recommendation which at the time of writing is XR 4.2.3

 

  • Turbo Boot each chassis independently with the 4.2.1
  • Upgrade the FPDs. This step is required because nV edge requires at least the RSP rommons to be corresponding to the 4.2.1 version.
  • Find the serial numbers of each chassis. The serial number is found in the “SN:” field in the example below (the FOX.. values)

 

(admin)#show inventory chassis

 

NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"

PID: ASR-9006-AC, VID: V01, SN: FOX1435GV1C

 

NAME: "chassis ASR-9006-AC", DESCR: "ASR-9006 AC Chassis"

PID: ASR-9006-AC, VID: V01, SN: FOX1429GJSV

 

Alternately, from rommon, the command “bpcookiebpcookie” can be used to get the serial number, look for the “Chassis Serial Number” description in the output of the command.

 

  • One of the chassis will end up being called “Rack0”, the other will be called “Rack1” – there are only two rack numbers possible.
  • Choose any one of the chassis as “Rack0”. ONLY on Rack0, enter the below config in admin config mode      
    • (admin config) # nv edge control serial <rack 0 serial> rack 0
      (admin config) # nv edge control serial <rack 1 serial> rack 1
      (admin config) # commit

 

The above configuration is just building a “data base” on Rack0  for all the chassis serial numbers and what rack numbers are assigned to those serial numbers. One purpose for this is to figure out whether a chassis that tries to become part of this nV Edge system is really “allowed” to be part of this nV edge or not.

 

  • Keep the future Rack1 down by keeping it unpowered or in rommon.
  • Wire up the control plane connections between the chassis (explained in detail in Section 3) and boot up Rack1

 

NOTE: The Control Ethernet cabling should be done only after all the previous steps have been executed and both the chassis are ready to “join” an nV Edge system. If Control Ethernet cables are connected between two functional independent single-chassis ASR9K nodes, that will wreak havoc in the system because the independent chassis’ control planes will get “mixed up” when they are not yet ready to “join” an nV Edge system.

 

  • The Rack1 chassis will send a boot request to Rack0 and get "changed" to Rack1. During this process Rack0 “adds” the Rack1 chassis to the nV Edge system after verifying its serial number, and the Rack1 chassis on booting up will communicate with Rack0 and also become  part of the nV Edge system.

 

  • Now the nV Edge system is booted up perfectly. Any further reboots of either or both of the chassis does not need any further user intervention. The chassis will come up and both of them will “join” the nV Edge system.

 

NOTE: ALL the interfaces on the chassis having the backup-DSC RSP will be in SHUTDOWN state till at least one Inter-Rack Data Link is in forwarding state. Discussed later in this write up in more details.

 

At any time in the nV Edge system, one of the RSPs in the nV edge system (in either Rack0 or Rack1) will be the “master” for the entire nV edge system. Another RSP in the system (again either in Rack0 or Rack1) will be the “backup” for the entire nV edge system. The “master” is called a primary-DSC using CRS Multi chassis terminology. The “backup” is called a backup-DSC. The primary-DSC will run all the primary protocol stacks (OSPF, BGP etc..) and the backup-DSC will run all the backup protocol stacks.

 

At any time, to find out which RSP is primary-DSC and which is backup-DSC, use the below command in admin exec mode.

 

RP/0/RSP0/CPU0:ios(admin)#show dsc

---------------------------------------------------------

           Node  (    Seq#)     Role      Serial# State

---------------------------------------------------------

    0/RSP0/CPU0  (       0)   ACTIVE  FOX1432GU2Z BACKUP-DSC

    0/RSP1/CPU0  ( 1223769)  STANDBY  FOX1432GU2Z NON-DSC

    1/RSP0/CPU0  ( 1279475)   ACTIVE  FOX1441GPND PRIMARY-DSC

    1/RSP1/CPU0  ( 1279584)  STANDBY  FOX1441GPND NON-DSC

 

As can be seen above, the Rack1 RSP0 (1/RSP0/CPU0) is the primary-DSC and Rack0 RSP0 (0/RSP0/CPU0) is the backup-DSC. The Primary and Backup DSCs do not have any “affinity” towards any one chassis or any one RSP. Whichever chassis in the nV edge system boots up first will likely select one of its RSPs as the primary-DSC.

 

Also another matter to note is that the “Active” / “Standby” states of the RSPs which we are familiar with in the single chassis mode of operation are superceded by the primary-DSC backup-DSC functionality in an nV Edge system. For example, in a single chassis system, protocol stacks used to run on the Active and Standby RSPs in a single chassis as primary/backup protocol stacks. But as we figured out in the preceeding paragraph, that is no more the case in an nV Edge system – in nV edge, the primary-DSC and backup-DSC are what runs the primary/backup of protocol stacks.

 

Supported hardware

 

  1. Only Typhoon and Thor line cards supported in the chassis, Tridents will not work
  2. Only Typhoon 10Gig links allowed as IRLs
  3. Only same chassis types can be connected to form an nV edge system
  4. Only Cisco supported SFPs allowed for all inter-rack connections
  5. The RSP front panel control plane SFPs HAVE TO BE 1GIG SFPs. 10Gig SFPs are NOT supported.
  6. The RSP front panel control plane SFPs MUST be SFP-GE-S (4.2.1+) or SFP-SX-MM (4.3.0+). No other optics are officially supported.

 

Booting with different images on each chassis

 

In an nV Edge system, for whatever reason if both the chassis end up having dis-similar images installed, then the chassis that boots up later will tell the already booted chassis about its version details – the already booted chassis will “reject” that version and tell that chassis to go down to rommon and send boot request to the already booted chassis to download the image that is present on the already booted chassis.

3      nV Edge Control Plane

 

The nV Edge control plane provides software and hardware extensions to create a “unified” control plane for all the RSPs and line cards on both the nv Edge chassis. The control plane packets are forwarded from chassis to chassis all “in hardware” as you will see in sections below. Control plane multicast etc.. is done in hardware for both the chassis – so there is no control plane performance impact because there are two chassis instead of one.

 

3.1    High  redundancy wiring (Recommended)

 

The nV Edge control plane links HAVE to be direct L1 connections, there is no network or intermediate routing / switching devices allowed in between.  Some details of the control plane connections are provided below to provide a better understanding of what exactly is the reasoning behind our recommendations. The control Ethernet links (front panel SFP+ ports) are configured in 1Gig mode of operation. The links numbered 1, 2, 3, 4 (red in colour) are the links that we are referring to that needs the wiring, the other links are there just for further illustration purpose as can be seen below.

 

Screen Shot 2013-02-18 at 1.06.23 PM.png

 

As seen in the diagram above, each RSP in each chassis has an Ethernet switch to which all the CPUs in the system (Line Card CPUs, RSP CPUs, any other CPUs in the system) connect to. So each CPU connects to two switches – one on each RSP. At any point in time, only one of the switches will be “active” and switching the control plane packets, the other will be “inactive” (regardless of whether system is nV edge or single chassis). And the “active” switch can be on either of the RSPs in the chassis, whichever switch can ensure the best connectivity across all the CPUs in the system.

 

The two SFP+ front panel ports on RSP3 are just direct ports plugging into the switch on the RSP.  So as shown in the diagram, for an nV Edge system, the simple goal is to connect each RSP (switch inside the RSP) to each switch on the remote chassis. So in the above case if any of the links go down, there are three possible backup links. Also at any point in time, only one of the links will be used for forwarding control plane data, all the other three links will be in “standby” state.

 

The control Ethernet is the heart of the system – anything wrong with it can badly degrade the nV edge system. So it is HIGHLY recommended to use all four control Ethernet links.

 

3.2    Low redundancy wiring (NOT SUPPORTED)

 

Screen Shot 2013-02-18 at 1.06.16 PM.png

 

The above mode of operation is possible in a “steady state”. Even if one link fault (link 1 or 2), there is one more link that can take over. So assume link 1 is faulty and we have only link 2 left now. And in this scenario, say the RSP 1/rsp1 ended up reloading because of some software fault. Then we are left with a case of no control Ethernet links at all between the chassis and in that mode, the chassis hosting the backup-DSC RSP  will take down itself and go to rommon, the chassis hosting the DSC RSP will continue functioning, thus avoiding a Split Node.

3.3    Single RSP in each chassis (NOT recommended)

 

In the case of a single RSP-per-chassis nV Edge topology, the below will be the wiring model. But again, this is not recommended because of resiliency reasons. If the only RSP in a chassis goes down, the entire chassis and all the line cards in the chassis also go down !

 

Screen Shot 2013-02-18 at 1.06.09 PM.png

 

3.4    Control Plane UDLD

 

We run UDLD on the control plane links to ensure bi-directional forwarding health of the links. The UDLD is run at 200 msecs interval x 5  - ie, an expiry interval of 1 second. Which means that if a control link is uni-directional for 1 second, then the RSPs will take action to switch the control plane link to one of the three standby links.

 

Note that the one second detection is only for unidirectional failures – for a physical link fault (like fiber cut), there will be interrupts triggered with the fault and the link switchover to the standby links will happen much faster.

3.5    Control Link status CLI

 

The front panel SFP+ ports are referred to as ports “0” and “1” in the show command below. So each RSP has two of these ports, and the command below shows which port on which RSP is connected to which other port on which other RSP.

 

In the example below:

 

Port “0” on 0/RSP0 is connected to port “0” on 1/RSP0.

Port “1” on 0/RSP0 is connected to port “1” on 1/RSP1

Port “0” on 0/RSP1 is connected to port “0” on 1/RSP1

Port “1” on 0/RSP1 is connected to port “1” on 1/RSP0

 

Also, the “port pair” that is “active” and used for forwarding control Ethernet data is the link between port “12” on 0/RSP0 and port “12” on 1/RSP0 as shown in the state Forwarding below. All other links are just backup links.

 

The “CLM table version” is also a useful number to note. This number if it changes means that the control link UDLD is flapping. So in a good “stable” condition, that number should not change.

 

RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/rSP0/CPU0

Priority lPort             Remote_lPort      UDLD STP

======== =====             ============      ==== ========

0        0/RSP0/CPU0/0    1/RSP0/CPU0/0    UP   Forwarding

1        0/RSP0/CPU0/1    1/RSP1/CPU0/1    UP   Blocking

2        0/RSP1/CPU0/0    1/RSP1/CPU0/0    UP   On Partner RSP

3        0/RSP1/CPU0/1    1/RSP0/CPU0/1    UP   On Partner RSP

Active Priority is 0

Active switch is   RSP0

CLM Table version is 2

3.6    Control Link shut/no shut CLIs

 

Each RSP has two front panel Control link ports which we number as 0 and 1. The CLI to shut the links is as below

 

RP/1/RSP0/CPU0:A9K-Cluster-IPE(admin-config)#nv edge control control-link disable <0 or 1 > location <the rsp where we want the port to be shut>

 

The “no nv edge control control-link disable ..” will unshut this link.

 

On shutting a control port, the CLI will also set a rommon variable on that RSP like “CLUSTER_0_DISABLE = 1” if port 0 is disabled and “CLUSTER_1_DISABLE = 1” if port 1 is disabled. As long as this rommon variable is set, neither rommon nor IOS-XR will ever enable that port.

 

The behavior when ALL the control links is shut obviously that both chassis becomes DSC .. But if the IRL links are active, then one of the chassis will reload, reboot and again once the IRL link comes back up it will again reboot.

 

So if someone configured ALL control links to be shut, how do we recover from that ? Currently this is the recommended procedure at the time of writing this in 4.2.3 24I early image.

 

1.      Shut the IRL links from one of the chassis (whichever chassis doesn’t reboot, remember one chassis comes up and reboots). This will get both chassis to stay UP.

 

2.      Reload one chassis and keep BOTH the RSPs in rommon and unconfigure a rommon variable as below, do this on BOTH the RSPs  

  • a.       rommon> unset CLUSTER_0_DISABLE
  • b.      rommon> unset CLUSTER_1_DISABLE
  • c.       rommon> sync
  • d.      rommon> reset

 

3.      On the other chassis which is still in XR, go to admin config and say “no nv edge control control-link disable <port> <location>” for each port and location where the port was shutdown.

 

4.      On the RSPs in rommon, say the below  

  • a.       rommon> boot mbi:

 

NOTE: The above is indeed a cumbersome and lengthy procedure (but only if we shut all control links), we have an enhancement which will be committed in 4.2.3 where the procedure to unshut would be very simple – on whichever chassis that doesn’t reboot, go to admin config mode and just say “no nv edge control control-link disable <port> <location>” and that will automatically take care of syncing it with the other chassis also.

 

3.7    Miscellaneous control link CLIs

 

  • 1.      show nv edge control control-link-port-counters – this CLI displays the Rx/Tx packet statistics through the control ethernet front panel ports (0 or 1)

 

  • 2.      show nv edge control control-link-sfp – this CLI dumps the SFP EEPROM that’s plugged into the front panel port. In addition it provides the data below

 

SFP Plugged in                           : 0x00000001 (1)

SFP Rx LOS                               : 0x00000000 (0)

SFP Tx Fault                             : 0x00000000 (0)

SFP Tx Enabled                           : 0x00000001 (1)

 

The “SFP Plugged in” should be value 1 if theres an SFP present. The “SFP Rx LOS” should be 0 or else there is Rx Loss of Signal (an error !). The “SFP Tx Fault” should be 0 or else theres is an SFP Fault (an error !). The “SFP Tx Enabled” should be 1 or else the SFP is not enabled from the control Ethernet driver (also an error !).

 

  • 3.      show nv edge control control-link-debug-counts – this is mostly for the Cisco engineering / support debugging. The values that might be of interest to a customer in there would be as below

 

Admin UP                                    : 0x00000001 (1)

SFP supported cached                  : 0x00000001 (1)

PHY status register                      : 0x00000070 (112)

 

The “Admin UP” 0 would mean that customer has configured “nv edge control control-link-disable <port> <location>” CLI. Without that config, it should be value 1 which is the default. The “SFP supported cached” indicates whether user plugged in a Cisco supported SFP – value 1 means the SFP is supported, 0 means SFP is not supported. If the control link has an SFP plugged in and has a cable connected to a remote end and the remote end is also up and laser is good, link is good etc.., then the “PHY status register” should have a value of 0x70, it is an internal PHY register which says that the link is all good. If there is no cable or no SFP or bad cable or bad link etc.., it will not be value 0x70, this can be sometimes useful during debugging.

 

4        nV Inter Rack Link (IRL) connections

 

The IRL connections are required for forwarded traffic going from one chassis out of interface on the other chassis part of the nV edge system. The requirement for the IRL link is that it has to be a 10 Gig link and that they have to be direct L1 connections – no sort of routed/switched devices are allowed in between. There can be a maximum of 16 such links between the chassis. Also recommended is a minimum of 2 links obviously for better resiliency, and also that the two links be on two separate line cards, again for better resiliency in case one line card goes down due to any fault.

 

The configuration of an interface as IRL is simple, its as below

 

interface tenGigE 0/1/1/1

nv

  edge

   interface

  !

 

Add this config on the IRL interfaces on both chassis of course ! We run UDLD over these links to monitory bi-directional forwarding health of these links. Only when UDLD reports that the echo and echo response are all fine (standard UDLD state machine), then we place the interface into “Fowarding” state, till then the interface is in “Configured” state. So the IRL interface might be “Configured” but not “Forwarding”, once its both, then it will be used for forwarding the data across chassis.

 

RP/0/RSP0/CPU0:ios#show nv edge data forwarding location 0/rSP0/CPU0

nV Edge Data interfaces in forwarding state: 1

 

tenGigE 0_1_1_1          <--> tenGigE 1_1_0_1

 

nV Edge Data interfaces in configured state: 2

 

tenGigE 1_1_0_1

tenGigE 0_1_1_1

 

The above CLI says that there are two IRLs in “Configured” state (marked blue) – of course one on each Rack. The CLI also says that there is one “pair” of IRLs in “Forwarding” state (marked green). The “pair” is one from each rack. So the UDLD protocol automatically detects which interface is connected to which other and forms a “pair”.

 

So if you have configured IRLs, but you don’t see  the line “nV Edge Data interfaces in forwarding state:” in your CLI output, then that means that something is wrong. We would recommend going through the standard interface checklist

 

-> Are the cables and SFPs all good ?

-> Are the interfaces unshut and Up/Up ?

-> Are there interface drops or errors ?

-> If you are conversant with the packet path, are there any other packet path drops ?

 

4.1    UDLD on  IRL links

 

The UDLD timers on the IRL links are set to 40 milliseconds times 5 hellos, ie around 200 msecs as the expiry timeout. That means that any uni-directional problem with the IRL links will be detected & corrected in around 250 msecs (200 msecs + delta for processing overheads).

 

If you want to see the UDLD state machine on the line card hosting these links, then the below CLI can be used. The Interface [number in red] is what we call the “ifhandle”. The interface name corresponding to that can be displayed using the CLI “show im database ifhandle <number in red> location <line card>”.

 

In the example below, the UDLD state is Bidirectional, which is the desired correct state when things are working fine.

 

RP/0/RSP0/CPU0:ios#show nv edge data protocol all location 0/1/cPU0

 

Interface [0x60002c0][769]

---

Port enable administrative configuration setting: Enabled

Port enable operational state: Enabled

Current bidirectional state: Bidirectional

Current operational state: Advertisement - Single neighbor detected

Message interval: 20 msec

Time out interval: 10000 msec

 

    Entry 1

    ---

    Expiration time: 140 msec

    Device ID: 1

    Current neighbor state: Bidirectional

    Device name: CLUSTER_RACK_01

    Port ID: [0x46000100][769]

    Neighbor echo 1 device: CLUSTER_RACK_00

    Neighbor echo 1 port: [0x60002c0][769]

 

    Message interval: 20 msec

    Time out interval: 100 msec

    CDP Device name: ASR9K CPU

 

4.2    What are the IRL links used for ?

 

The IRL links are used for forwarding packets whose ingress and egress interfaces are on separate racks. They are also used for all protocol Punt packets and protocol Inject packets. As explained in Section 1, the protocol stack “Primary” runs on the primary-DSC RSP in one of the chassis. So if a protocol punt packet comes in on an interface in another chassis, it has to be punted to the primary-DSC RSP in the remote chassis. This punt is done via the IRL. Similarly if the protocol stack on the primary-DSC wants to send a packet out of an interface on another chassis, that is also done via the IRL interfaces.

 

4.3    nV IRL “threshold monitor”

 

If the number of IRL links available for forwarding goes below a certain threshold, that might mean that the remaining IRLs will get congested and more and more inter-rack traffic will get dropped. So the IRL-monitor gives a way of  shutting down other ports on the chassis if the number of IRL links go below a threshold. The commands available are below

 

RP/0/RSP0/CPU0:ios(admin-config)#nv edge data minimum <minimum threshold> ?

  backup-rack-interfaces    Disable ALL interfaces on backup-DSC rack

  selected-interfaces           Disable only interfaces with nv edge min-disable config

  specific-rack-interfaces   Disable ALL interfaces on a specific rack

 

There are three modes of configuration possible.

 

4.3.1   Backup-rack-interfaces config

 

With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on whichever chassis is hosting the backup-DSC RSP will be shut down. Again note that the backup-DSC RSP can be on either of the chassis.

 

4.3.2   Specific-rack-interfaces config

 

With this configuration, if the number of IRLs go below the <minimum threshold> configured, ALL interfaces on the specified rack (0 or 1) will be shut down.

 

4.3.3   selected-interfaces config

 

With this configuration, if the number of IRLs go below the <minimum threshold> configured, the interfaces on any of the racks that are explicitly configured to be brought down will be shut down. How do we “explicitly” configure an interface (on any rack) to respond to IRL threshold events ?

 

RP/0/RSP0/CPU0:ios(config)#interface gigabitEthernet 0/1/1/0

RP/0/RSP0/CPU0:ios(config-if)#nv edge min-disable

RP/0/RSP0/CPU0:ios(config-if)#commit

 

So in the above example, if the number of IRLs go below the configured minimum threshold, interface Gig0/1/1/0 will be shut down.

 

4.3.4   What is the default config

 

The default config (if customer does not configure any of the above explicitly) is the equivalent of having configured “nv edge data minimum 1 backup-rack-interfaces”. Which means that if the number of IRLs in forwarding state goes below 1 (at least 1 forwarding IRL), then ALL the interfaces on whichever rack that has the backup-DSC, will get shut down.

 

This might make some customers happy, some unhappy. This behaviour can be turned off by just saying “nv edge data minimum 0 backup-rack-interfaces” – basically this says that if the number of IRLs in forwarding state goes below 0 (which will never happen), only then we should bother shutting any interface on any rack.

 

4.4    Default QoS on IRL links

 

When an interface is configured as an IRL link, we install 5 absolute priority queues on the port in both the ingress and egress directions. The priorities are below

 

  1.       All protocol punt / inject packets like protocol Hellos etc..
  2.       Multicast traffic
  3.       Fabric priority  0 traffic
  4.       Fabric priority  1 traffic
  5.       Fabric priority  2 traffic

 

The IRL links do not allow “user configurable” MQC policies on the IRL interface. The classification of “punt / inject” and “multicast” are done “internally” in microcode – that is, other than being a punt/inject or multicast packet, there is no way by which we can “influence/force” a packet to go to the first two queues.

 

What packet gets into the last three queues can be influenced – just by having QoS ingress policies that mark packets appropriately to be acos 0, 1 or 2.  There is no other way by which we can influence what gets into these queues. The queue id selected on the ingress chassis’s IRL links is carried across in the Vlan COS bits, the egress chassis’s IRL that gets this packet will use this queue id encoded in the Vlan COS to select the queues it uses on Ingress (when it receives the packets from the remote chassis).

 

The CLI to display the nV edge qos queues is as below for example using an IRL interface  with configs below. The subslot number 0 in the example is the “subslot” in which the MPA (the pluggable adaptor) is on the MOD-80/160 line card in Viking. If the line card is not of a type that supports pluggable adaptors, just use 0 for subslot. The port number 1 used in the example is simply the last number in the 1/1/0/1 notation.

 

The drops (if any) in these queues are aggregated and reflected in the “show interface” drops also. The standard interface MIBs can be used for monitoring these drops. Note that the individual queue drops are not exported to MIBs, only the aggregate drops are exported as the interface drops. Also the IRL links are just regular interfaces, so the regular interface MIBs will all work on IRLs also.

 

RP/0/RSP0/CPU0:ios#sh running-config interface gigabitEthernet 1/1/0/1

interface GigabitEthernet1/1/0/1

nv

  edge

   interface

  !

 

 

RP/0/RSP0/CPU0:ios#show qoshal cluster subslot 0 port 1 location 1/1/cPU0

 

Cluster Interface Queues : Subslot 0, Port 1

===============================================================

Port 1 NP 0 TM Port 17

    Ingress: QID 0xa8 Entity: 0/0/0/4/21/0 Priority: Priority 1 Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f0348/0x0/0x5f0349

            Statistics(Pkts/Bytes):

              Tx_To_TM  681762/140538069

              Total Xmt 681762/140538069 Dropped 0/0

 

    Ingress: QID 0xa9 Entity: 0/0/0/4/21/1 Priority: Priority 2 Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f034d/0x0/0x5f034e

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Ingress: QID 0xab Entity: 0/0/0/4/21/3 Priority: Priority 3 Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f0357/0x0/0x5f0358

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Ingress: QID 0xaa Entity: 0/0/0/4/21/2 Priority: Priority Normal Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f0352/0x0/0x5f0353

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Ingress: QID 0xac Entity: 0/0/0/4/21/4 Priority: Priority Normal Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f035c/0x0/0x5f035d

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Egress: QID 0xc8 Entity: 0/0/0/4/25/0 Priority: Priority 1 Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f03e8/0x0/0x5f03e9

            Statistics(Pkts/Bytes):

              Tx_To_TM  3372382/697778537

              Total Xmt 3372382/697778537 Dropped 0/0

 

    Egress: QID 0xc9 Entity: 0/0/0/4/25/1 Priority: Priority 2 Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f03ed/0x0/0x5f03ee

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Egress: QID 0xcb Entity: 0/0/0/4/25/3 Priority: Priority 3 Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f03f7/0x0/0x5f03f8

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Egress: QID 0xca Entity: 0/0/0/4/25/2 Priority: Priority Normal Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f03f2/0x0/0x5f03f3

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

    Egress: QID 0xcc Entity: 0/0/0/4/25/4 Priority: Priority Normal Qdepth: 0

            StatIDs: commit/fast_commit/drop: 0x5f03fc/0x0/0x5f03fd

            Statistics(Pkts/Bytes):

              Tx_To_TM  0/0

              Total Xmt 0/0 Dropped 0/0

 

RP/0/RSP0/CPU0:ios#

 

 

4.5    Configurable QoS on IRL interfaces

 

To support more flexible QoS options for customers who want more than the default QoS mentioned in Section 4.4, we provide option for configuring regular MQC policies on EGRESS direction (no ingress support) with some limitations. The limitation in one simple sentence is that the MQC policy configured on an IRL does not have the ability to access the packet contents – that is, there is no way to figuring out whether the packet that goes out on IRL is ipv4 or ipv6 etc.. So none of the MQC features that needs to look into the packet will work. So how exactly is it used ?

 

Typical use case is that customer will configure an ingress MQC policy map on an regular (non-IRL) ingress interface. That ingress MQC policy can parse the packet and set a “qos-group” for the packet. The egress IRL policymap can then match on this qos-group and apply features like queueing and shaping. Random detect can also be applied (not based on dscp though – remember that needs access to packet contents) and of course no marking either.

 

The user is not prevented from applying any MQC policy on the IRL regardless of whether that policy has features unsupported on the IRL or not. That is no config level rejection of policies is done on the IRL interface yet (this might be enforced in later releases), so user has to take care to configure only supported features or else the behavior is unpredictable. For example if user configures an egress MQC policy on the IRL that does marking, then the packet going out of the IRL will have contents changed in some random location and that might cause those packets to be dropped !

The configuration of MQC on IRL and the show commands etc.. are exactly the same as MQC on a regular interface (remember IRL is just a regular interface !).

4.6    IRL packet encap and overhead

 

The packet that goes out on the IRL will have a Vlan encap with vlan hard-coded to vlan-id 1. The vlan-id really doesn’t matter, we just use the vlan COS bits to carry over the packet priority as mentioned in section 4.4. So that is 18 bytes overhead. In addition there is around 24 bytes of over head, which depends very much on the kind of packet (l3 / l2 / mcast etc..) being transported. So on average we have around 42 bytes overhead.

4.7    IRL load balancing

 

IRL load balances packets based on flow. How a “flow” is defined varies from feature to feature. In general, for any given feature, if we ask the question “how does this feature packet get load balanced across link bundle members”, the same answer would apply to load balancing across IRLs also. In other words, IRL load balancing obeys the exact same principles as link bundle member load balancing. In other words, a “32 bit” hash value is calculated for each packet/feature and that 32 bit hash value (with some bit flips etc.. to avoid polarization) would get used for IRL load balancing as well as link bundles.

 

Let us examine the different kinds of features in very brief below. This is by no means meant to be an exhaustive documentation of all the load balancing algorithms on the router, rather just to give an overview of the major classes of load balancing.

4.7.1   Ingress IP packet

 

This is the standard tuple used for hash calculation for load balancing across link bundle members – like the source ip, dest ip, source port, dest port, protocol type. It does not matter whether the egress is IP or MPLS, the ingress is all that matters

4.7.2   Ingress MPLS packet

 

If the incoming packet is MPLS, the forwarding engine looks deeper to see if the underlying packet is IP. If it is IP, then the standard IP hash tuple is used for calculating the hash. If the underlying packet is not IP, then just the labels from the label stack are used for calculating the hash. The label allocation mode (per CE or per VRF) has no impact on the hash.

 

4.7.3   L2 Unicast

 

There load balancing will be done based on src/dst mac addresses. Again, as explained initially this doesn’t become an exhaustive answer because there are scenarios where the VC label hash is used in vpls scenario.

4.7.4   L2 Flood

 

For L2 flood traffic over link bundles, there are multiple elaborate modes of load balancing, the exhaustive documentation is probably best referred to along with the L2 link bundle documentation. But in general, there are two modes of load balancing that is tied to the flooding mode in L2.

4.7.4.1  Flood optimized mode

 

In this mode, to restrict the L2 floods from reaching too many line cards, the hash is “statically” chosen based on bridge group. So some bridge groups will be “tied” to one IRL, few others to another IRL – same behaviour chosen for L2 over link bundles.

 

4.7.4.2  Convergence / Resiliency mode

 

In this mode, the L2 flood is hashed in ucode based on the src/dst mac addresses.

 

4.7.5   L3 Multicast

 

L3 Multicast hashes multicast flows based on (S,G) and uses that hash to distribute packets across the IRLs – again the same technique used for distributing multicast packets across link bundle members.

 

5        nV Edge Redundancy model

 

There are four very simple rules that can always help in determining the primary-DSC and backup-DSC RSPs in an nV edge system.

 

1.      Primary-DSC and backup-DSC both are always the “Active” RSP in each chassis. The “Active” here refers to the “Active” we know in the context of a single chassis ASR9K – where one RSP is “Active” and another is “Standby”

 

2.      Primary-DSC and backup-DSC will always be on RSPs in different chassis.

 

3.      If a Primary-DSC goes down, then the backup-DSC becomes primary-DSC. Then the chassis other than the one hosting the primary-DSC will select its “Active” RSP as the next backup-DSC (since the old backup just became primary).

 

4.      If any RSP other than the primary-DSC or backup-DSC goes down, there is no change in the state of the primary-DSC or backup-DSC.

 

With these four rules in place, in any give scenario, we can figure out what happens if any of the RSPs in any of the chassis go down.

5.1    Redundancy switchover: Control Ethernet readiness

 

Before issuing redundancy switchover, it’s a good practice to check the control links in the system and check that there is at least one backup link available that can take over. For example in the output below, if we decide to issue “redundancy switchover” on 0/RSP0/CPU0, we have three more links (shown as “Blocking” or “On Partner RSP”) and one of them can take over as the link connecting control planes of both chassis (see Section 3.1 for details).

 

Some times it might happen that because of some fault (say fiber cut or bad sfp etc..), a few links are down in which case you wont see those links (neither as “Blocking” nor “On Partner RSP”). So unless there is at least one backup link, if we issue a switchover, then the only link that if “Forwarding” will go away and there wont be any more control plane connectivity across the chassis.

 

NOTE: We are enhancing the “redundancy switchover” CLI to automatically check this condition and disallow the cli to go through if there are no backup links. Till that enhancement is done, it is recommended to do this manual procedure.

 

RP/0/RSP0/CPU0:ios# show nv edge control control-link-protocols location 0/rSP0/CPU0

Priority lPort             Remote_lPort      UDLD STP

======== =====             ============      ==== ========

0        0/RSP0/CPU0/12    1/RSP0/CPU0/12    UP   Forwarding

1        0/RSP0/CPU0/13    1/RSP1/CPU0/13    UP   Blocking

2        0/RSP1/CPU0/12    1/RSP1/CPU0/12    UP   On Partner RSP

3        0/RSP1/CPU0/13    1/RSP0/CPU0/13    UP   On Partner RSP

 

5.2    RSP/Chassis failure Detection in ASR9k nV Edge

 

In an ASR-9k nV Edge system, on failure of the Primary DSC node the RSP in the Backup DSC role becomes Primary, with the duties of being the system “master” RSP and hosting the active set of control plane processes.  In the normal case for nV Edge, the Primary and Backup DSC nodes are hosted on separate racks.  This means that the failure detection for the Primary DSC occurs via communication between racks.

 

The following mechanisms are used to detect RSP failures across rack boundaries:

  1.      FPGA state information detected by the Peer RSP in the same chassis is broadcast over the control links.  This is sent if any state change occurs, and periodically every 200ms.
  2.      The UDLD state of the inter-chassis control links to the remote rack, with failures detected at 500ms
  3.      The UDLD state of the inter-chassis data links to the remote rack, failures detection at 500ms
  4.      A keep-alive message sent between RSP cards via the inter-chassis control links, with a failure detection time of 10 seconds.

 

Additionally messages are sent between racks for the purpose of Split Node avoidance / detection.  These occur at 200ms intervals across the inter-chassis data links, and optionally can be configured redundantly across the RSP Management LAN interfaces.

 

 

Example HA Scenarios:

 

•1.      Single RSP Failure of the Primary DSC node

 

The Standby RSP within the same chassis initially detects the failure via the backplane FPGA.  On failure detection this RSP will transition to the active state and notify the Backup DSC node of the failure via the inter-chassis control link messaging.

 

•2.      Failure of Primary DSC node and it’s Standby peer RSP.

 

There are multiple cases where this case can occur, such as power-cycle of the Primary DSC rack or simultaneous soft reset of both RSP cards within the Primary rack.

 

The remote rack failure will initially be detected by UDLD failure on the inter-chassis control link.  The Backup DSC node checks the state if the UDLD on the inter-chassis data link.  If the rack failure is confirmed by failure of the data link as well, then the Backup DSC node becomes active. 

 

UDLD failure detection occurs in 500ms, however the time between control link and data link failure can vary since these are independent failures detected by the RSP and LC cards.  A windowing period of up to 2 seconds is needed to correlate the control and data link failures, and to allow for split-brain detection messages to be received.

 

The keep-alive messaging between RSP acts as a redundant detection mechanism, should the UDLD detection fail to detect a stuck or reset RSP card. 

 

•3.      Failure of Inter-Chassis control links (Split Node)

 

Failure is initially detected by the UDLD protocol on the Inter-Chassis control links.  Unlike the rack reload scenario above, the Backup DSC will continue receiving UDLD and keep-alive messages via the inter-chassis data link.  Similar to the rack reload case, a 2 second windowing period is allowed to correlate the control/data link failures.  If after 2 seconds the data link has not failed, or Split Node packets are being received across the Management LAN then the Backup DSC rack will reload to avoid the Split Node condition.

 

 

6        Split Node

 

There are primarily two sets of links connecting the chassis in the nV edge system.

 

•1.      Control links (recommended four of them)

•2.      IRL links (minimum one)

So the two sets of links together will be at least FIVE wires. Let us see what can happen when there is a fault and a complete set of control links or IRL links or both go away (become faulty ?)

 

Screen Shot 2013-02-18 at 1.05.46 PM.png

 

 

6.1    All IRL links go away

 

In this case, refer to Section 4.3 – both chassis will be up and functioning, but the interfaces on one of the chassis “might” get shut-down based on what config is present on the box (or whether its just the default config). Again, Section 4.3 should be referred to to understand what config is appropriate for you.

 

6.2    All Control links go away

 

The two chassis in the nV edge system cannot function as “one entity” without control links. We have beacons that each chassis periodically exchanges over the IRL links. So if control links go down, then each chassis will know via the IRL beacons that the other chassis is UP, and one of the chassis has to just take itself down and go back to rommon.

 

Which chassis should go back to rommon ? The logical choice is the chassis hosting the Primary DSC RSP stays up, and the Non-Primary rack resets. Reason being that the chassis hosting the primary-DSC has all the “primary” protocol stacks and hence we want to avoid disturbing the protocols as much as possible. So we take the non-primary rack down to rommon and it tries to boot and join the nV edge system again – at some point if one or more control links become healthy again, that chassis will bootup and join the nV edge system again.

 

Since IOS-XR cannot stabilize with the control links severed in this way, the non-primary rack will continue to bootup, detect that the control links are down and reset until the connectivity issue is resolved.

 

The CLI command “show nv edge control control-link-protocols” can be used to assess the current status of the control links in the event of a problem.

 

6.3    All Control AND IRL links go away

 

In this scenario, we can “potentially” enter what is called a “Split Node” – where each chassis thinks that the other chassis has gone down and each of them declares itself as the master. So protocols like OSPF will start having two instances each with the same router-id etc.. and that can be a problem for the network.

 

So to try and mitigate this scenario, we provide one more set of “last gasp” paths via the management LAN network. On EACH RSP in the system, we should connect one of the two management LAN interfaces (any one of them) to an L2 network so that all four of those interfaces (from each RSP) can send L2 packets to each other. Then we can enter the below configuration on each of those management LAN interfaces.

 

interface MgmtEth0/RSP0/CPU0/1

nv

  edge

   split-brain

  !

 

So what this will do is that on each RSP, we will send high frequency beacons on these interfaces at 200 millisecond intervals. So if both chassis are functional, both chassis will get beacons from the other. And in such a scenario, if both chassis comes to know that  both of them are working independently, then they know it’s a problematic scenario and one of them will take itself down.  The chassis to reset will be the one that has been in the primary state for the least amount of time.

 

So this “Split Node” management lan path provides yet another alternate path to provide additional resiliency to try and avoid a nasty “Split Node” scenario.

 

But if the Control links AND IRL links AND split-brain management lan links ALL of them go away, then there’s no way to exchange any beacons across the chassis and then we will enter the split-brain scenario where both chassis starts functioning independently. In scenario such that the mgmt network on both chassis are not in the same subnet, or not in the same location, a L2 connection should be facilitated to provide the last gasp.

 

NOTE: The Split Node interface messages are meant to be “best effort” messages, currently we do not monitor for the “health” of those links. Those links are regular Management Ethernet interfaces and will have all the usual UP/DOWN traps etc.. But for example there are intermittent monitoring message drops on those links, then we do not raise any alarm or complaint. We might enhance this in future to include some monitoring of the packet drops (if any) on these links to alert the user.

 

7        Feature configuration caveats

 

7.1    Virtual Interfaces (Link bundle / BVI) mac-address

 

The link bundle / BVI configuration on nV Edge requires a manual configuration of mac-address under the interface. An example for link bundle shown below

 

interface Bundle-Ether15

   mac-address 26.51c5.e602  <== A mac like this needs to be configured explicity

 

Also for link bundle, the below lacp global configuration is also required

 

lacp system mac 0201.debf.0000

 

This caveat / requirement will be fixed in later release, till then we need to have this configuration for link bundles / BVIs / any virtual interfaces to work on nV Edge system.

 

7.2    Link Bundle “switchover suppress-flap” : Rack / Chassis reload

 

interface Bundle-Ether15

  lacp switchover suppress-flaps 15000

 

The “bundle manager” is a process that runs on the primary (DSC) and backup (backup-DSC) RSPs and is responsible for the configuration and state maintanence of the link bundle interfaces. When the primary (DSC) chassis in an nV Edge system is reloaded, the bundle-manager on the backup-DSC needs to “go active” and start connections to some external processes that provide other services (ICCP as an example). A Chassis reload is a much more “heavy” operataion compared to a regular RSP switchover because a chassis reload involves the going-down of all rsps and all line cards on that chassis and this cause quite a lot of control plane churn compared to a regular rsp switchover where theres only one node that goes away (one rsp). For example the basic infrastructure processes that handle the IPC (Inter Process Communication) in the system has to do a lot of “Cleanups”, they have to cleanup data structures corresponding to all the nodes that went away and flush packets from/to those nodes etc.. The routing protocols / rib has to process a lot of interface down notifications and start NSF / GR Etc.. Owing to this additional control plane load, when the bundle-manager asks for connecting to external “services”, those services will take more time to respond because they are already busy processing node down events.

 

Hence, the bundle-manager process might be “blocked” for a longer period of time compared to a regular swover scenario. So during this “blocked” time period, the remote end might time out and declare the bundle down. To prevent this, we have the “lacp switchover suppress-flap <seconds>” command. This needs to be configured on the nV Edge system AND the remote boxes (if remote is not IOS-XR box, whatever is the equivalent of that config in that box). This basically tells the link bundle to tolerate more control packet losses during this period.

 

In the example here, we have configured a 15 second tolerance – note that this DOES NOT mean that there will be a 15 second packet drop. Bundle manager will update the data plane to use a newly active link as soon as it gets the event which decides who is active (notification from peer in case of MC-LAG) and data can start flowing. All this does is to prevent bundle from going down if the rest of the bundle manager control plane is busy doing other stuff (like connecting to services) while the peer is expecting some control packets Rx/Tx.

 

NOTE: In 4.2.3 24I early image, at the time of writing this note, we are trying to optimize these “connection calls to services” and bring down their time requirement so that the suppress swover can also be reduced. There are more than one services involved, and hence we have to optimize multiple of them to get the suppress swover time requirement down. But worst case if there are some services which just refuse to be optimized with minimal work, the swover suppress config being a larger value (like 15 seconds) should not have any other detrimental side effects.

 

7.3    IGP protocols and LFA-FRR

 

ASR9K nV Edge High Availability mode is unique in that it is probably the only High Availability model where we “expect” topology changes during a Backup to Primary Switchover like during a Rack / Chassis reload. If the Primary (DSC) chassis is reloaded, and if that chassis had IGP interface(s) on its line card(s), then when the Backup-DSC takes over as Primary-DSC, it has to do switchover processing AND at the same time process topology changes due to the loss of interfaces.

 

But as we know, for handling switchover cases gracefully, its normal that customers configure Non Stop Forwarding (NSF) under IGP protocols like ISIS. So now when the DSC Chassis is reloaded, the new DSC (old backup-DSC) will immediately start NSF on IGP (say ISIS) and as we know about regular NSF, it can take many seconds (default 90 seconds, can be changed by the nsf lifetime CLI) for NSF to be completed and the RIB will be informed about topology changes only AFTER NSF is complete.

 

So during this time frame, the new DSC chassis will have stale routes pointing to interfaces that are not existing any more (which were on the chassis that was reloaded). And this can lead to a large period of traffic loss. So what is the solution ? If we think through this problem, what we are asking for is the CEF / FIB to change the forwarding tables even though Routing Protocols / RIB has not asked it to do so. And this exactly fits the bill for the LFA-FRR feature. So without LFA-FRR, the convergence time during a chassis reload in an nV Edge system will be bad, LFA-FRR is a simple configuration, a basic example below. Note that LFA FRR can work with ECMP paths – one path in the ECMP list can backup the other path in the ECMP list.

 

router isis Cluster-L3VPN

<snip>

interface Loopback0

  address-family ipv4 unicast

  !

!

interface TenGigE0/1/0/5

address-family ipv4 unicast

   fast-reroute per-link

 

8        Feature Gaps

 

BFD Multihop is one feature that is supported on a single chassis, but not on the nV Edge system.

 

The nV Edge system also doesn’t support clock / syncing features like syncE.

9        Convergence numbers (subject to change just a datapoint!)

 

After configuring all the required caveats mentioned in Section 7, at the time of writing this in 4.2.3 24I early image time frame, the convergence number for an L3VPN profile with Access facing Link bundle (one member each from each chassis) and Core facing ECMP (two IGP links one from each chassis) with 3K eBGP sessions and one million routes is around 8 seconds for a Chassis Reload (any of the chassis) in the nv Edge System. The number for sure will be different for different profiles, each profile needs separate measurement and qualification / tuning. The obvious question can be that how much lower can it get ? The natural comparison that we end up doing is a comparison with an RSP failover. The factors that are (very) different between RSP failover and chassis reload are

 

  • •1.      Chassis reload is a “software detected” event .. A regular RSP switchover in an ASR9K system is a “hardware detected” event because both RSPs are in the same chassis and one going down will trigger an interrupt for the other. Whereas a chassis going away is detected by loss of keep alive packets from one chassis to the other. And how fast we detect a failure is a fine balance between speed and stability. If we detect keep alive time outs too fast, the margin for errors / packet losses in the system is narrow and we might have false triggers. If we detect too slow, then the convergence suffers

 

  • •2.      Chassis reload involves a heavy amount of control plane churn – line cards go away, hence interfaces go away, so the control plane protocols, control plane infrastructure (like IPC – Inter Process Communication) etc.. has to do work to update this state and make sure that it clears up data structures related to entities that went away. Imagine if the Chassis that went away had like 128K interfaces ! That will trigger quite some control plane activity

 

  • •3.      Chassis reload involves updating data plane on the surviving chassis where as RSP failover does not touch the data plane. And based on scale, this can be a time consuming activity also.

 

  • •4.      Chassis reload can involve topology change and updates triggered by the neighbouring boxes whereas RSP switchover is practically unknown to the peers (especially if NSR is enabled for all protocols).

 

Because of all these reasons, its almost impossible to achieve anything better than say 3 to 4 seconds (currently 8 seconds) for the L3VPN profile mentioned in the beginning of this section. And the delta 5 seconds might come after quite a high engineering investment towards it.

 

10  “Debugging mode” CLIs – cisco support only

 

These clis are visible only for  cisco-support users. There are many more CLIs than explained below, many of them are purely related to tuning the internal control port error-retry logic etc.. inside the driver and unlikely to be of use to anyone other than the engineers. Some of those explained below are quite “generic”, related to the UDLD protocol etc.. and hence explained below.

 

  • •1.      nv edge control control-link udldpriority – this CLI sets the thread priority of the process handling the UDLD packets to higher / lower value. Maximum is 56 and minimum is 10. We sometimes try tweaking this to higher values when we find that the CPU is being loaded by some other high priority activity and hence UDLD flaps. We also tweak it sometimes to be lower in case we find that the UDLD thread itself is hogging too much CPU.

 

  • •2.      nv edge control control-link udldttltomsg – this is a multiplier that affects the UDLD timeout. For some reason (say high CPU utilization or too many link errors etc..) if we want to make UDLD run slower, then this value can be set to a larger value. The UDLD timeout will be 50msecs times this multiplier

 

  • •3.      nv edge control control-link allowunsupsfp – we allow only Cisco supported 1Gig SFPs in the front panel control ports, this CLI allows any SFP (that the PHY on the board supports) to be plugged in.

 

  • •4.      nv edge control control-link noretry – by default if the front panel control ports have some error, a retry algorithm kicks in a backoff timer mode to bring the port up again. If we don’t want a retry, this CLI disables the retry algorithm.

 

  • •5.      nv edge data allowunsup – by default only 10Gig interfaces are allowed as IRLs. If some other interface type (like 1Gig) has to be enabled as IRL for some debugging /  testing, this CLI has to be configured first before the IRL config will be allowed under the unsupported interface.

 

  • •6.      nv edge data stopudld – again for any debugging reasons, if the UDLD protocol has to be stopped on the IRL, this CLI can be used. Any **configured** IRL interface will be declared as available for forwarding regardless of the interface state (UP or DOWN). So be careful while using this CLI.

 

  • •7.      nv edge data udldpriority - – this CLI sets the thread priority of the process handling the UDLD packets (on the line card hosting the IRL) to higher / lower value. Maximum is 56 and minimum is 10. We sometimes try tweaking this to higher values when we find that the CPU is being loaded by some other high priority activity and hence UDLD flaps. We also tweak it sometimes to be lower in case we find that the UDLD thread itself is hogging too much CPU.

 

  • •8.      nv edge data udldttltomsg - this is a multiplier that affects the IRL UDLD timeout. For some reason (say high CPU utilization on the LC hosting IRL or too many link errors on IRL etc..) if we want to make UDLD run slower, then this value can be set to a larger value. The UDLD timeout will be 20msecs times this multiplier

 

11  nV Edge MIBs

 

 

The SNMP agent and MIB specific configuration have no differences for the nV Edge scenario.

 

11.1   Redundancy related MIBs

 

With upto four RSPs in an nV Edge system, and each chassis having an “Active / Standby” pair of RSPs and the nV Edge altogether having a “primar-DSC / backup-DSC” pair, there are multiple redundancy elements that come into picture. There is “node redundancy” which says in a given chassis, which node is “Active” and which node is “Standby”. There is a node-group redundancy which says in an nV Edge system, which is the “primary-DSC” and which is the “backup-DSC”. And there are “process groups”  which have their own redundancy characteristics – for example protocol stacks (say ospf) have redundancy across the primary-DSC/backup-DSC pair. Where as some other “system” software elements will have redundancy across the “Active / Standby” RSPs in each chassis. This relationship is called “process groups” and their redundancy. The table below summarises the mibs.

 

MIB

Node Redundancy

Process Redundancy

Description

CISCO-RF-MIB

   

Currently provides DSC chassis   active/standby node pair info.  In nV   Edge scenario should provide DSC primary/backup RP info.  Provides switchover notification.

ENTITY-STATE-MIB

Status only; no relationships

 

Provides redundancy state info for each   node.  No relationships indicated.

CISCO-ENTITY-STATE-EXT-MIB

   

Extension to ENTITY-STATE-MIB which   defines notifications (traps) on redundancy status changes.

CISCO-ENTITY-REDUNDANCY-MIB

Both status and relationships

Process group redundancy relationships   & node status

Define redundancy group types:

  • •1)           Node   redundancy group type
  • •2)           Process group   redundancy type

Node   redundancy pairs would be shown in groups with the node redundancy group   type.  Primary/backup nodes for each   process group placed on them.

 

11.1.1                     Node Redundancy MIBs

 

CISCO-RF-MIB is currently used to monitor the node redundancy of the DSC chassis’ active/standby RPs.  The MIB definition is limited to representing redundancy relationships, status, and other info of only 2 nodes

 

CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes. The redundant node pairs are defined as redundancy groups with a group type indicating the group is a redundant node pair.  The members of the group would be the nodes within the node-redundant pair.

 

11.1.2                     Process Redundancy MIBs

 

Support for the CISCO-ENTITY-REDUNDANCY-MIB is used to model the redundancy relationships of pairs of nodes pertaining to the specific process groups.  The redundant process groups are defined as redundancy groups with  a group type indicating the group is a redundant process group.  The members of the group would be the nodes where the primary and backup processes are placed for that process group.

 

11.1.3                     Inventory Management

The inventory information for each chassis and the respective physical entities will be available just as in the single chassis.  The difference for ASR9K nV Edge (as in CRS multi-chassis) is the presence of a top-level entity in the hierarchy which acts as a container of the chassis entities.  This entity will have entPhysicalClass value of ‘stack’.

 

 

 

11.2   IRL monitoring MIBs

 

IRL interface are in ALL respects just a regular IOS-XR interface. All the standard interface mibs for reporting errors / alarms / faults on the link will apply to the IRL links. Also all the standard mibs for the interface statistics will also apply to these links.

 

One missing MIB is for the “uni-directional” forwarding state of the IRL. For example if there is excessive packet loss on IRL which makes it go into a UDLD state of “uni-directional”, that is a fault scenario and that IRL link is removed from all forwarding tables, even though the physical state of the interface remains UP. This will be an enhancement required to get this event reported to MIB. One approach would be to just shut the link down on uni-directional fault so that the standard ifmib can trap this event.

11.3   Control Ethernet monitoring MIBs

 

The CRS Multi chassis system has implemented some MIBs for the Control Ethernet aspects of the system :- they are currently not implanted for the nv Edge system. But since the nV Edge system control Ethernet is very similar to CRS Multi Chassis Control Ethernet, we can implement those exacts MIBs for the nV Edge system also. That would be an enhancement work item.

 

The Control Ethernet MIB frontend is a collection of MIBs as below.

 

  • •1.      IF-MIB implementation upgraded to support Control Ethernet interfaces
  • •2.      CISCO-CONTEXT-MAPPING MIB implementation.
  • •3.      Context aware implementation of BRIDGE-MIB
  • •4.      Implementation of MAU-MIB
  • •5.      Implementation of CISCO-MAU-EXT-MIB, which will distinguish the MAUs associated with Control Ethernet interfaces from those associated with other data-plane interfaces
  • •6.      ENTITY-MIB upgraded to support Control Ethernet related entities like Control Ethernet Bridges and associated bridge-ports and all Control Ethernet interfaces

.

11.4   Control Ethernet Syslog / error messages

 

Below we down the most important syslog error messages that indicates some fault with the control Ethernet module or links.

 

  • 1.      Front panel nV Edge Control Port <port> has unsupported SFP plugged in. Port is disabled, please plug in Cisco support 1Gig SFP for port  to be enabled

 

LOG_INFO message: This message pops up if user inserts a Cisco unsupported SFP in the front panel SFP+ port. User has to replace the SFP with a Cisco supported one and the port will automatically get detected / used again.

 

  • 2.      Front Panel port <port>  error disabled because of UDLD uni directional forwarding. There will be automatic retries to try and bring up the port periodically

 

LOG_CRIT message: This message pops up if a particular control Ethernet links has a fault and keeps “flapping” too frequently. If that happens then this port is disabled and will not be used for control link packet forwarding till user issues the above mentioned CLI.

 

  • 3.      ce_switch_srv[53]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 12 (SFP+_00_10GE) is up

ce_switch_srv[53]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 12 (SFP+_00_10GE) is down

 

These messages pops up whenever the Control Plane link (the front panel links) physical state changes up up/down – more like a regular interface up/down event notification. The “Interface 12” and “Interface 13” (the 12 and 13) are just internal numbers for the two front panel ports. These messages will pop up anytime a remote RSP goes down or boots up because at those instances the remote end laser goes down/up. But during normal operation of the nV Edge system when there are no RSP reboots etc.., these messages are not expected and indicates a problem with the link / sfp etc..

 

11.5   Data Link Syslog / error messages

 

Here we describe the syslog / error messages related to the IRL links that can appear in the logs and describe them so that user is aware of what those messages mean.

 

  • 1.      Interface <interface handle> has been uni directional for 10 seconds, this might be a transient condition if a card bootup / oir etc.. is happening and will get corrected automatically without any action. If it’s a real error, then the IRL will not be available for forwarding inter-rack data and will be missing in the output of show nv edge data forwarding CLI.

 

Here the interface name being referred to can be found by saying “show im database ifhandle <interface handle>” – that particular interface has encountered a uni directional forwarding scenario and will be removed from the forwarding tables – no more data will be forwarded across those IRLs. We will try re-starting UDLD on that link again after 10 seconds to see if the UDLD can become bi-directional again, so this retry will keep happening every 10 seconds until the link goes bi-directional or the user decides to unconfigure “nv edge interface” on that link forever.

 

 

  • 2.      <count> Inter Rack Links configured all on one slot. Recommended to spread across at least two slots for better resiliency.

 

All the IRL links are present on the same line card (slot). This is not good for resiliency reasons. If that line card goes down, all the IRL links also go down. So the message periodically pops up asking the user to configure the IRLs to be spread across at least two slots.

 

  • 3.      Inter Rack Links configured  on <count> slots.Recommended to spread across maximum 5 slots for better manageability and troubleshooting.

 

The total number of IRLs in the system (maximum 16) is recommended to be spread across NO MORE than 5 line cards (slots). This is purely for debuggability reasons, debugging problems across more than 5 IRL LCs becomes a complex affair and hence a recommendation is to limit the spread to maxium 5 slots.

 

  • 4.      Only one Inter Rack Link is configured. For Inter Rack Link resiliency, recommendation is to have at least two links spread across at least two slots.

 

We recommend having at least two IRL links for resiliency reasons.

 

 

12 Debugs and Traces

 

The output of all CLIs mentioned below can be redirected to a file / tftp server etc.. When in doubt as to which module traces to collect, its better just to collect all of the below.

 

12.1   IRL Links

 

If there are issues with the IRL links, please collect the below information. All CLIs in regular exec mode

 

1. show nv edge data trace all error location all

2. show nv edge data trace all event location all

 

12.2   Control Plane links.

 

If there are issues with control plane connectivity, please collect the below information. The below CLIs in the regular exec mode.

 

 

1. show nv edge control switch links detail location <each of the four RSPs>

2. show nv edge control control-link-protocols  <each of the four RSPs>

3. show nv edge control clm-trace lib error location <each of the four RSPs>

4. show nv edge control clm-trace lib events location <each of the four RSPs>

5. show nv edge control control-link-debug-counts location <each of the four RSPs>

 

The below CLI in admin exec mode.

 

1. (admin)#show udld trace location <each of the four RSPs>

 

12.3   Redundancy related problems, general RSP bootup issues etc..

 

All the below CLIs are in admin exec mode.

 

1. show tech dsc <each of the four RSPs>

2. show dsc trace <each of the four RSPs>

3. show dsc <each of the four RSPs>

4. show dsc history <each of the four RSPs>

5. show dsc stats <each of the four RSPs>

 

 

 

Related Information

 

Xander Thuijs CCIE #6775

Principal Engineer ASR9000

 

Content courtesy of the ASR9000 nV-edge team

Comments
gerardtorin
Level 1
Level 1

Hello Xander, I hope that everything is ok for you. Could you give us more information about this setup with the ASR9001 Platform?

Thanks a lot for this guide!!!

BR

Gerard

xthuijs
Cisco Employee
Cisco Employee

Hi Gerard, the nice thing about the A9K is, regardless of whether we talk 9006, 9010 the 9001 is the exact same architecture and, well let me be a bit careful here also, MOST of the things are transparent between all three platforms.

So you can cluster 2 9001's together also.

cheers!

xander

gerardtorin
Level 1
Level 1

Thanks for you answer Xander, I have a few questions with this platform:

1. Which SFP I need in the cluster ports? Can I use SFP-10G-SR for these ports?

2. Which ports are going to use for IRL? The same cluster ports? Or I have to use the builtin 10G ports?

Thanks a lot for you help.

Gerard


xthuijs
Cisco Employee
Cisco Employee

Great question Gerard!

The EOBC extensions are the dedicated ports on the RSP front face place with the name "eobc".

They are 1G. We recommend 2 links.

The onboard tenGig's can be used for the IRL (inter rack link for data transport between the two cluster nodes).

Or the onbard can be used for connecting "customer traffic".

Also MPA based (ten)gig's can be used for IRL.

cheers

xander

gerardtorin
Level 1
Level 1

Thanks Xander, Is the datasheet wrong? Because It put the EOBC ports as 10G, It call them Control Plane Extensions Ports. Are they the same ports?

http://www.cisco.com/en/US/prod/collateral/routers/ps9853/ps12074/data_sheet_c78-685687_ps9853_Products_Data_Sheet.html

Again, you're showing me the light!!!

Thanks

Gerard

xthuijs
Cisco Employee
Cisco Employee

Hi Gerard... apologies for the confusion, that cco doc is then not correct, we do 1G only...

its more then enough.

do me a favor and provide feedback on that doc to have that corrected and ref my name pls.

cheers!!

xander

gerardtorin
Level 1
Level 1

Sure Xander, I already made the observation on the feedback form.

Have a great weekend!!

Thanks

Gerard

qicui
Community Member

Hi, Xander

Does ASR9K NV support OSPF  NSR, NSF?  BGP NSR ?

xthuijs
Cisco Employee
Cisco Employee

Hi Qingyan,

NSF is support natively that is part of the hw architecture, fortunately

NSR is supported for OSPF and BGP also, just as in standalone.

cheers

xander

qicui
Community Member

Thank you, Xander

harindhafdo
Level 1
Level 1

Hi Xander,

in which image you will be supporting ASR9000V Sattelite Ring termination into a ASR9K nv Cluster ? when it is in the ring topology how will be the interface numbering be done for ASR9000v ports ?

Rgds

Harin

xthuijs
Cisco Employee
Cisco Employee

That is a XR 5.1.1 deliverable Harin.

xander

Cuong Nguyen
Level 1
Level 1

Hi Xander,

Regarding to the "7. Feature configuration caveats", I'd like to check:

0. Are those caveats fixed in 4.2.3 and SMU?

If not, then:

1. If there have only Bundle-Interface and EVC configured on those bundle-interfaces, is manual MAC address setting under the Bundle-Interface still required?

2. Let's say RACK0 is ready (in production) with all the bundle-interfaces up. Now, RACK1 is added into cluster. Is the LACP system MAC command needed to make the bundle-interfaces in RACK1 up? When that command is issued, what will happen to the bundle-interfaces in RACK0 (which are up and running)?


Besides, we are testing the nV now. When adding RACK1 into the cluster, it takes a very long time (in our case, it has been 2.5 hours and is still counting). Anyway to reduce the timespent?

Do you have any document that explains the steps and behaviors of the nV when we need to upgrade IOS without disrupting customer traffic?

Best regards,

Cuong.

xthuijs
Cisco Employee
Cisco Employee

Hi Cuong,

Not yet, however cluster is a key focus area for us and improvements are being made release by release.

You still need to fix the bundle mac, virtual interfaces get their mac from the chassis backplane eeprom and obviously the 2 different chassis have different eeproms hence different base mac addresses. The mac fix will allow for a more seamless failover.

On the LACP question: in cluster, there is 1 RSP that is the master for the whole cluster (typically rack0). That is the guy that will orginate all the messaging as if it were a single system. So a bundle that has members in rack 1 is effectively being mastered out of rack 0 as it has the active/primary RSP.

As for time duration, yes there are many sync and file transfer improvements already in XR431, this is also an ever evolving topic so you'll see improvements as we move forward.

If the version on rack1 is the same, the sync duration should be limited.

As for cluster upgrades, today the "solution" is to bring down rack 1, upgrade rack 0 if it were a single chassis and after that upgrade bring back rack 1 again for a sync. This is not ideal, we recognize that. So we have cluster ISSU in the works that allows for rACK1 to be isolated, and upgraded to a new version, upon which you do a failover from rack 0 (old ver) to rack 1 (new ver) with lesser disruption.

cheers!

xander

Cuong Nguyen
Level 1
Level 1

Hi Xander,

Thank you for the prompt response.

On the Bundle-Interfaces and LACP, do you know the command to get the EEPROM of the chassis?

I guess in order for the LACP and Bundle-Interfaces to continue to work withour disrupting current traffic, the configured MAC should be similar with the one being used. Since we have bundle-interfaces handling live traffic, changing system MAC of LACP and Bundle-interface would result in a flap of bundle-interface.

It is great to know that there is a way to upgrade IOS without bring down the service. It is the purpose of the nV, isn't it? Do you know when cluster ISSU is implemented or what IOS-XR version it is supported in ASR9000?

The other question I have in mind is the possibility of changing/replacing a chassis in cluster. Chassis does break down sometimes. Is there any way to replace a chassis with minimal impact on the service?

Regards,

Cuong.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Quick Links