ASR9000/XR: Load-balancing architecture and characteristics - Page 7

xthuijs · ‎08-28-2012

Introduction

Introduction

In this document it is discussed how the ASR9000 decides how to take multiple paths when it can load-balance. This includes IPv4, IPv6 and both ECMP and Bundle/LAG/Etherchannel scenarios in both L2 and L3 environments

Core Issue

The load-balancing architecture of the ASR9000 might be a bit complex due to the 2 stage forwarding the platform has. In this article the various scenarios should explain how a load-balancing decision is made so you can architect your network around it.

In this document it is assumed that you are running XR 4.1 at minimum (the XR 3.9.X will not be discussed) and where applicable XR42 enhancements are alerted.

Load-balancing Architecture and Characteristics

Characteristics

ASR9000 has the following load-balancing characteristics:

ECMP:

Non recursive or IGP paths : 32-way
Recursive or BGP paths:

8-way for Trident
32 way for Typhoon
64 way Typhoon in XR 5.1+
64 way Tomahawk XR 5.3+ (Tomahawk only supported in XR 5.3.0 onwards)

Bundle:

64 members per bundle

The way they tie together is shown in this simplified L3 forwarding model:

NRLDI = Non Recursive Load Distribution Index

RLDI = Recursive Load Distribution Index

ADJ = Adjancency (forwarding information)

LAG = Link Aggregation, eg Etherchannel or Bundle-Ether interface

OIF = Outgoing InterFace, eg a physical interface like G0/0/0/0 or Te0/1/0/3

What this picture shows you is that a Recursive BGP route can have 8 different paths, pointing to 32 potential IGP ways to get to that BGP next hop, and EACH of those 32 IGP paths can be a bundle which could consist of 64 members each!

Architecture

The architecture of the ASR9000 load-balancing implementation surrounds around the fact that the load-balancing decision is made on the INGRESS linecard.

This ensures that we ONLY send the traffic to that LC, path or member that is actually going to forward the traffic.

The following picture shows that:

In this diagram, let's assume there are 2 paths via the PATH-1 on LC2 and a second path via a Bundle with 2 members on different linecards.

(note this is a bit extraordinary considering that equal cost paths can't be mathematically created by a 2 member bundle and a single physical interface)

The Ingress NPU on the LC1 determines based on the hash computation that PATH1 is going to forward the traffic, then traffic is sent to LC2 only.

If the ingress NPU determines that PATH2 is to be chosen, the bundle-ether, then the LAG (link aggregation) selector points directly to the member and traffic is only sent to the NP on that linecard of that member that is going to forward the traffic.

Based on the forwarding achitecture you can see that the adj points to a bundle which can have multiple members.

Allowing this model, when there are lag table udpates (members appearing/disappearing) do NOT require a FIB update at all!!!

What is a HASH and how is it computed

In order to determine which path (ECMP) or member (LAG) to choose, the system computes a hash. Certain bits out of this hash are used to identify member or path to be taken.

Pre 4.0.x Trident used a folded XOR methodology resulting in an 8 bit hash from which bits were selected
Post 4.0.x Trident uses a checksum based calculation resulting in a 16 bit hash value
Post 4.2.x Trident uses a checksum based calculation resulting in a 32 bit hash value
Typhoon 4.2.0 uses a CRC based calculation of the L3/L4 info and computes a 32 bit hash

8-way recursive means that we are using 3 bits out of that hash result

32-way non recursive means that we are using 5 bits

64 members means that we are looking at 6 bits out of that hash result

It is system defined, by load-balancing type (recursive, non-recursive or bundle member selection) which bits we are looking at for the load-balancing decision.

Fields used in ECMP HASH

What is fed into the HASH depends on the scenario:

Incoming Traffic Type	Load-balancing Parameters
IPv4	Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID

IPv6	Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID

MPLS - IP Payload, with < 4 labels	Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID
From 6.2.3 onwards, for Tomahawk + later ASR9K LCs: MPLS - IP Payload, with < 8 labels	Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID Typhoon LCs retain the original behaviour of supporting IP hashing for only up to 4 labels.
MPLS - IP Payload, with > 9 labels	If 9 or more labels are present, MPLS hashing will be performed on labels 3, 4, and 5 (labels 7, 8, and 9 from 7.1.2 onwards). Typhoon LCs retain the original behaviour of supporting IP hashing for only up to 4 labels.
- IP Payload, with > 4 labels	4th MPLS Label (or Inner most) and Router ID
- Non-IP Payload	Inner most MPLS Label and Router ID

* Non IP Payload includes an Ethernet interworking, generally seen on Ethernet Attachment Circuits running VPLS/VPWS.

These have a construction of

EtherHeader-Mpls(next hop label)-Mpls(pseudowire label)-etherheader-InnerIP

In those scenarios the system will use the MPLS based case with non ip payload.

IP Payload in MPLS is a common case for IP based MPLS switching on LSR's whereby after the inner label an IP header is found directly.

Router ID

The router ID is a value taken from an interface address in the system in an order to attempt to provide some per node variation

This value is determined at boot time only and what the system is looking for is determined by:

sh arm router-ids

Example:

RP/0/RSP0/CPU0:A9K-BNG#show arm router-id

Tue Aug 28 11:51:50.291 EDT

Router-ID Interface

8.8.8.8 Loopback0

RP/0/RSP0/CPU0:A9K-BNG#

Bundle in L2 vs L3 scenarios

This section is specific to bundles. A bundle can either be an AC or attachment circuit, or it can be used to route over.

Depending on how the bundle ether is used, different hash field calculations may apply.

When the bundle ether interface has an IP address configured, then we follow the ECMP load-balancing scheme provided above.

When the bundle ether is used as an attachment circuit, that means it has the "l2transport" keyword associated with it and is used in an xconnect or bridge-domain configuration, by default L2 based balancing is used. That is Source and Destination MAC with Router ID.

If you have 2 routers on each end of the AC's, then the MAC's are not varying a lot, that is not at all, then you may want to revert to L3 based balancing which can be configured on the l2vpn configuration:

RP/0/RSP0/CPU0:A9K-BNG#configure

RP/0/RSP0/CPU0:A9K-BNG(config)#l2vpn

RP/0/RSP0/CPU0:A9K-BNG(config-l2vpn)#load-balancing flow ?

src-dst-ip Use source and destination IP addresses for hashing

src-dst-mac Use source and destination MAC addresses for hashing

Use case scenarios

Case 1 Bundle Ether Attachment circuit (downstream)

In this case the bundle ether has a configuration similar to

interface bundle-ether 100.2 l2transport

encap dot1q 2

rewrite ingress tag pop 1 symmetric

And the associated L2VPN configuration such as:

l2vpn

bridge group BG

bridge-domain BD

interface bundle-e100.2

In the downstream direction by default we are load-balancing with the L2 information, unless the load-balancing flow src-dest-ip is configured.

Case 2 Pseudowire over Bundle Ether interface (upstream)

The attachment circuit in this case doesn't really matter, whether it is bundle or single interface.

The associated configuration for this in the L2VPN is:

l2vpn

bridge group BG

bridge-domain BD

interface bundle-e100.2

vfi MY_VFI

neighbor 1.1.1.1 pw-id 2

interface bundle-ether 200

ipv4 add 192.168.1.1 255.255.255.0

router static

address-family ipv4 unicast

1.1.1.1/32 192.168.1.2

In this case neighbor 1.1.1.1 is found via routing which appens to be egress out of our bundle Ethernet interface.

This is MPLS encapped (PW) and therefore we will use MPLS based load-balancing.

Case 3 Routing through a Bundle Ether interface

In this scenario we are just routing out the bundle Ethernet interface because our ADJ tells us so (as defined by the routing).

Config:

interface bundle-ether 200

ipv4 add 200.200.1.1 255.255.255.0

show route (OSPF inter area route)

O IA 49.1.1.0/24 [110/2] via 200.200.1.2, 2w4d, Bundle-Ether200

Even if this bundle-ether is MPLS enabled and we assign a label to get to the next hop or do label swapping, in this case

the Ether header followed by MPLS header has Directly IP Behind it.

We will be able to do L3 load-balancing in that case as per chart above.

(Layer 3) Load-balancing in MPLS scenarios

As attempted to be highlighted throughout this technote the load-balacning in MPLS scenarios, whether that be based on MPLS label or IP is dependent on the inner encapsulation.

Depicted in the diagram below, we have an Ethernet frame with IP going into a pseudo wire switched through the LSR (P router) down to the remote PE.

Pseudowires in this case are encapsulating the complete frame (with ether header) into mpls with an ether header for the next hop from the PE left router to the LSR in the middle.

Although the number of labels is LESS then 4. AND there is IP available, the system can't skip beyond the ether header and read the IP and therefore falls back to MPLS label based load-balancing.

How does system differentiate between an IP header after the inner most label vs non IP is explained here:

Just to recap, the MPLS header looks like this:

Now the important part of this picture is that this shows MPLS-IP. In the VPLS/VPWS case this "GREEN" field is likely start with Ethernet headers.

Because hardware forwarding devices are limited in the number of PPS they can handle, and this is a direct equivalent to the number of instructions that are needed to process a packet, we want to make sure we can work with a packet in the LEAST number of instructions possible.

In order to comply with that thought process, we check the first nibble following the MPLS header and if that starts with a 4 (ipv4) or a 6 (ipv6) we ASSUME that this is an IP header and we'll interpret the data following as an IP header deriving the L3 source and destination.

Now this works great in the majority scenarios, because hey let's be honest, MAC addresses for the longest time started with 00-0......

in other words not a 4 or 6 and we'd default to MPLS based balancing, something that we wanted for VPLS/VPWS.

However, these days we see mac addresses that are not starting with zero's anymore and in fact 4's or 6's are seen!

This fools the system to believe that the inner packet is IP, while it is an Ether header in reality.

There is no good way to classify an ip header with a limited number of instruction cycles that would not affect performance.

In an ideal world you'd want to use an MD5 hash and all the checks possible to make the perfect decision.

Reality is different and no one wants to pay the price for it either what it would cost to design ASICS that can do high performance without affecting the PPS rate due to a very very comprehensive check of tests.

Bottom line is that if your DMAC starts with a 4 or 6 you have a situation.

Solution

Use the MPLS control word.

Control word is negotiated end to end and inserts a special 4 bytes with zero's especially to accommodate this purpose.

The system will now read a 0 instead of a 4 or 6 and default to MPLS based balancing.

Configuration

to enable control word use the follow template:

l2vpn

pw-class CW

encapsulation mpls

control-word

!

xconnect group TEST

p2p TEST_PW

interface GigabitEthernet0/0/0/0

neighbor 1.1.1.1 pw-id 100

pw-class CW

!

Alternative solutions: Fat Pseudowire

Since you might have little control over the inner label, the PW label, and you probably want to ensure some sort of load-balancing, especially on P routers that have no knowledge over the offered service or mpls packets it transports another solution is available known as FAT Pseudowire.

FAT PW inserts a "flow label" whereby the label has a value that is computed like a hash to provide some hop by hop variation and more granular load-balancing. Special care is taken into consideration that there is variation (based on the l2vpn command, see below) and that no reserved values are generated and also don't collide with allocated label values.

Fat PW is supported starting XR 4.2.1 on both Trident and Typhoon based linecards. From 6.5.1 onward we support FAT label over PWHE.

Packet transformation with a Flow Label

Configuration of FAT Pseudowire

The following is configuration example :

l2vpn

load-balancing flow src-dst-ip

pw-class test

encapsulation mpls

load-balancing

flow-label both static

!

You can also affect the way that the flow label is computed:

Under L2VPN configuration, use the “load-balancing flow” configuration command to determine how the flow label is generated:

l2vpn

load-balancing flow src-dst-mac

This is the default configuration, and will cause the NP to build the flow label from the source and destination MAC addresses in each frame.

l2vpn

load-balancing flow src-dst-ip

This is the recommended configuration, and will cause the NP to build the flow label from the source and destination IP addresses in each frame.

• Note that IPv6 hashing is not supported in the first release.

FAT Pseudowire TLV

Flow Aware Label (FAT) PW signalled sub-tlv id is currently carrying value 0x11 as specified originally in draft draft-ietf-pwe3-fat-pw. This value has been recently corrected in the draft and should be 0x17. Value 0x17 is the flow label sub-TLV identifier assigned by IANA.

When Inter operating between XR versions 4.3.1 and earlier, with XR version 4.3.2 and later. All XR releases 4.3.1 and prior that support FAT

PW will default to value 0x11. All XR releases 4.3.2 and later default to value 0x17.

Solution:

Use the following config on XR version 4.3.2 and later to configure the sub-tlv id

pw-class <pw-name>

encapsulation mpls

load-balancing

flow-label both

flow-label code 17

NOTE: Got a lot of questions regarding the confusion about the statement of 0x11 to 0x17 change (as driven by IANA) and the config requirement for number 17 in this example.

The crux is that the flow label code is configured DECIMAL, and the IANA/DRAFT numbers mentioned are HEX.

So 0x11, the old value is 17 decimal, which indeed is very similar to 0x17 which is the new IANA assigned number. Very annoying, thank IANA

(or we could have made the knob in hex I guess )

Loadbalancing and priority configurations

In the case of VPWS or VPLS, at the ingress PE side, it’s possible to change the load-balance upstream to MPLS Core in three different ways:

1. At the L2VPN sub-configuration mode with “load-balancing flow” command with the following options:

RP/0/RSP1/CPU0:ASR9000(config-l2vpn)# load-balancing flow ?

src-dst-ip

src-dst-mac [default]

2. At the pw-class sub-configuration mode with “load-balancing” command with the following options:

RP/0/RSP1/CPU0:ASR9000(config-l2vpn-pwc-mpls-load-bal)#?

flow-label [see FAT Pseudowire section]

pw-label [per-VC load balance]

3. At the Bundle interface sub-configuration mode with “bundle load-balancing hash” command with the following options:

RP/0/RSP1/CPU0:ASR9000(config-if)#bundle load-balancing hash ? [For default, see previous sections]

dst-ip

src-ip

It’s important to not only understand these commands but also that: 1 is weaker than 2 which is weaker than 3.

Example:

l2vpn

load-balancing flow src-dst-ip

pw-class FAT

encapsulation mpls

control-word

transport-mode ethernet

load-balancing

pw-label

flow-label both static

interface Bundle-Ether1

(...)

bundle load-balancing hash dst-ip

Because of the priorities, on the egress side of the ingress PE (to the MPLS Core), we will do per-dst-ip load-balance (3).

If the bundle-specific configuration is removed, we will do per-VC load-balance (2).

If the pw-class load-balance configuration is removed, we will do per-src-dst-ip load-balance (1).

with thanks to Bruno Oliveira for this priority section

P2MP MPLS TE Tunnels

Only one bundle member will be selected to forward traffic on the P2MP MPLS TE mid-point node.

Possible alternatives that would achieve better load balancing are: a) increase the number of tunnels or b) switch to mLDP.

IPv6

Pre 4.2.0 releases, for the ipv6 hash calculation we only use the last 64 bits of the address to fold and feed that into the hash, this including the regular routerID and L4 info.

In 4.2.0 we made some further enhancements that the full IPv6 Addr is taken into consideration with L4 and router ID.

Determining load-balancing

You can determine the load-balancing on the router by using the following commands

L3/ECMP

For IP :

RP/0/RSP0/CPU0:A9K-BNG#show cef exact-route 1.1.1.1 2.2.2.2 protocol udp ?

source-port Set source port

You have the ability to only specify L3 info, or include L4 info by protocol with source and destination ports.

It is important to understand that the 9k does FLOW based hashing, that is, all packets belonging to the same flow will take the same path.

If one flow is more active or requires more bandwidth then another flow, path utilization may not be a perfect equal spread.

UNLESS you provide enough variation in L3/L4 randomness, this problem can't be alleviated and is generally seen in lab tests due the limited number of flows.

For MPLS based hashing :

RP/0/RSP0/CPU0:A9K-BNG#sh mpls forwarding exact-route label 1234 bottom-label 16000 ... location 0/1/cpu0

This command gives us the output interface chosen as a result of hashing with mpls label 16000. The bottom-label (in this case '16000') is either the VC label (in case of PW L2 traffic) or the bottom label of mpls stack (in case of mpls encapped L3 traffic with more than 4 labels). Please note that for regular mpls packets (with <= 4 labels) encapsulating an L3 packet, only IP based hashing is performed on the underlying IP packet.

Also note that the mpls hash algorithm is different for trident and typhoon. The varied the label is the better is the distribution. However, in case of trident there is a known behavior of mpls hash on bundle interfaces. If a bundle interface has an even number of member links, the mpls hash would cause only half of these links to be utlized. To get around this, you may have to configure "cef load-balancing adjust 3" command on the router. Or use odd number of member links within the bundle interface. Note that this limitation applies only to trident line cards and not typhoon.

Bundle member selection

RP/0/RSP0/CPU0:A9K-BNG#bundle-hash bundle-e 100 loc 0/0/cPU0

Calculate Bundle-Hash for L2 or L3 or sub-int based: 2/3/4 [3]: 3

Enter traffic type (1.IPv4-inbound, 2.MPLS-inbound, 3:IPv6-inbound): [1]: 1

Single SA/DA pair or range: S/R [S]:

Enter source IPv4 address [255.255.255.255]:

Enter destination IPv4 address [255.255.255.255]:

Compute destination address set for all members? [y/n]: y

Enter subnet prefix for destination address set: [32]:

Enter bundle IPv4 address [255.255.255.255]:

Enter L4 protocol ID. (Enter 0 to skip L4 data) [0]:

Invalid protocol. L4 data skipped.

Link hashed [hash_val:1] to is GigabitEthernet0/0/0/19 LON 1 ifh 0x4000580

The hash type L2 or L3 depends on whether you are using the bundle Ethernet interface as an Attachment Circuit in a Bridgedomain or VPWS crossconnect, or whether the bundle ether is used to route over (eg has an IP address configured).

Polarization

Polarization pertains mostly to ECMP scenarios and is the effect of routers in a chain making the same load-balancing decision.

The following picture tries to explain that.

In this scenario we assume 2 bucket, 1 bit on a 7 bit hash result. Let's say that in this case we only look at bit-0. So it becomes an "EVEN" or "ODD" type decision. The routers in the chain have access to the same L3 and L4 fields, the only varying factor between them is the routerID.

In the case that we have RID's that are similar or close (which is not uncommon), the system may not provide enough variation in the hash result which eventually leads to subsequent routers to compute the same hash and therefor polarize to a "Southern" (in this example above) or "Northern" path.

In XR 4.2.1 via a SMU or in XR 4.2.3 in the baseline code, we provide a knob that allows for shifting the hash result. By choosing a different "shift" value per node, we can make the system look at a different bit (for this example), or bits.

In this example the first line shifts the hash by 1, the second one shifts it by 2.

Considering that we have more buckets in the real implementation and more bits that we look at, the member or path selection can alter significantly based on the same hash but with the shifting, which is what we ultimately want.

HASH result Shifting

Trident allows for a shift of maximum of 4 (performance reasons)
Typhoon allows for a shift of maximum of 32.

Command

cef load-balancing algorithm adjust <value>

The command allows for values larger then 4 on Trident, if you configure values large then 4 for Trident, you will effectively use a modulo, resulting in the fact that shift of 1 is the same as a shift of 5

Fragmentation and Load-balancing

When the system detects fragmented packets, it will no longer use L4 information. The reason for that is that if L4 info were to be used, and subsequent fragments don't contain the L4 info anymore (have L3 header only!) the initial fragment and subsequent fragments produce a different hash result and potentially can take different paths resulting in out of order.

Regardless of release, regardless of hardware (ASR9K or CRS), when fragmentation is detected we only use L3 information for the hash computation.

Hashing updates

- Starting release 6.4.2, when an layer 2 interface (EFP) receives mpls encapped ip packets, the hashing algorithm if configured for src-dest-ip will pick up ip from ingress packet to create a hash. Before 6.4.2 the Hash would be based on MAC.

- Starting XR 6.5, layer 2 interfaces receiving GTP encapsulated packets will automatically pick up the TEID to generate a hash when src-dest-ip is configured.

Related Information

Packet troubleshooting and forwarding architecture

Xander Thuijs, CCIE #6775

Sr Tech Lead ASR9000

Vladimir Pisarenko · ‎08-10-2015

Hi,

We use MGSCP in one of our bundles between two ASR9010 (SCE is located between them and it requires this feature).

After software upgrade on one router, one of the links stoped forwarding traffic in one direction.

From one side all looks ok:

Bundle-Ether2
Type: Ether (L3)
Members <current/max>: 4/64
Total Weighting: 4
Load balance: Src IP
Locality threshold: 65
Avoid rebalancing? True
Sub-interfaces: 30

Member Information:
Port: LON ULID BW
-------------------- --- ---- --
Te0/2/0/3 3 3 1
Te0/2/0/6 2 2 1
Te0/3/0/3 1 1 1
Te0/3/0/4 0 0 1

From the other side I see this:

Bundle-Ether2
Type: Ether (L3)
Members <current/max>: 4/64
Total Weighting: 4
Load balance: Dst IP
Locality threshold: 65
Avoid rebalancing? True
Sub-interfaces: 29

Member Information:
Port: LON ULID BW
-------------------- --- ---- --
Te0/0/0/0 0 0 1
Te0/1/0/0 2 1 1
Te0/2/0/2 0 3 1
Te0/2/0/3 1 2 1

To the interfaces Te0/0/0/0 and Te0/2/0/2 was appointed LON 0

I tried to change one of the links and to shut entire bundle, bun it didn't help.

May be this (lacp cisco enable link-order signaled) is not necessary in MGSCP configuration? Or I need to something else?

Thanks!

donalddangelo · ‎10-08-2015

Xander,

A fairly obscure question. the cef load-balancing alg adjust command for the trident card - does it have any negative implications on the system you issue the command. In my org, we have very strict configurations, as such any new commands must be fully vetted.

Unfortunately we have trident cards in our network, and we have link bundles as well. This is a must have command, and while its working as expected, I'm looking to find out if it degrades other functions or services on the router it is issued.

Great write up by the way, as always. This one saved me a great deal of time on a few occasions.

xthuijs · ‎10-08-2015

Hi Don,

fair question, but although in technical term yes this shifting will cost some cycles, the perf impact is very very little if not unmeasurable. the shift is basically in assembler term an RRC/RLC (rotate left/right through carry bit) which is done in one machine cycle. For these NPU's they run already at such high speed that the time measured is nano seconds... (eg a 10Mhz processor with 12 tick machine cycles would carry out an assembler instruction in 1 usec. Thinking these NPU's run at several GHz were talking ns here).

oh and also thanks for your comment :), nice to hear!!

cheers!

xander

donalddangelo · ‎10-08-2015

Xander,

Thank you for the quick reply. This is what I expected, and just the answer I was hoping for. Thanks again. Have a great day.

Sergey · ‎10-12-2015

Xander,

You never mention a "bundle load-balancing hash" command.

We have a couple of security appliances that need symmetric flow (e.g. same subscriber sould always pass the same link for both outbound and inbound traffic) placed between ASR9k. The idea to accomplish the goal is to configure src-ip load-balancing in the inside and dst-ip on the outside bundle (this is how it is done on the MGSCP solution).

Could you please clarify how the hardware (typhoon) works in the described setup and what limitations are in place.

Aleksandar Vidakovic · ‎10-13-2015

Typhoon will work fine in that setup. Please note that MGSCP requires bundles to be L3.

regards,

/Aleksandar

donalddangelo · ‎10-28-2015

Hi Xander,

I was hoping you could elaborate on the hash being computed before the flow label imposition. As you may or may not know, in the SP realm, it occurs often enough to mention, that we have PE to PE scenarios, some rural areas simply dictate this type of architecture. These scenarios may or may not have owned assets supporting their interconnects. e.g. Type II. Either way this limitation poses some problems.

Take the following example

PE==ECMP==P==ECMP==PE (All ASR9k)

Let’s assume for a moment that the CE on either side has limited information in which to hash upon, which is often times the case. In a lab I can create 1k's of device pairs, however real world scenarios don’t often cooperate. In fact I've got this setup in the lab with 8 device pairs, with limited variance to the IP address pairs. My load balancing out of the PE to P chooses one path, always.

Can you help me, and probably others, understand why the flow label is not imposed before the hashing decision is made on the PE? Will this be addressed by Cisco? Do you happen to know if this behavior is one only related to Cisco or simply a protocol or technology limitation not vendor specific?

Let’s take the above diagram a step further for a moment, just to help quantify the gravity of the problem I happen to be encountering. The ECMP between the nodes are all Type II, so its cost prohibitive to deal with capacity planning when I should be able to load balance on the ECMP I have.

I could place an ASR9001 as PE on either side of the current PE's

ASR9k1PE===P==TypeII==P==ASR9k1PE

Just so I can impose the flow label on the 9k1 where I can control interconnects with optics and jumpers. I can then have LB decisions being made on the flow label over the Type II paths. Does this make sense?

Look forward to your input.

xthuijs · ‎10-29-2015

hi donald,

part of the input processing is that when the packet comes in, one of the first things that happens is the computation of the hash based on the actual input packet (in Parse)

When the flow label is inserted it happens at a later stage (Resolve/Modify)

So the hash is already there when the flow label gets inserted.

In order to leverage the flow label, in that case you need to effectively recirculate the packet to recompute a hash based on that inserted flow label. Recirculation drops your pps performance.

Now think of this too; the flow label is computed based on the original arriving IP packet, so effectively the flow label is a hash on the ip, if there is no variance in IP the flow label won't be much different either.

The Flow label is only useful for P routers since they have zero knowledge on the inner stuff and would start to balance/hash on the PW label.

the PE router HAS knowledge of the IP and that is what is used.

So it is a sound design decision to do it this way.

It is not a cisco specific thing, other vendors do something pretty similar and is effectively described in RFC too...

does this help explaining?

xander

mauricio.andrade81 · ‎12-11-2015

Hi Xander,

Can you please explain me the route installation order in XR for ECMP? I see the order impact the decision from BGP next-hop in multipath escenario. For example OSPF have two equal path for one route, I distributed OSPF in BGP for vpnv4 vrf but the next-hop for BGP is based on order that´s XR install OSPF route in main table:

EX:

show route vrf test 70.6.8.9 (ECMP)

first OSPF NH: 192.168.2.1

second OSPF NH: 192.168.3.1

BGP NH: 192.168.2.1

show route vrf test 70.6.8.9 (ECMP)

first OSPF NH: 192.168.3.1

second OSPF NH: 192.168.2.1

BGP NH: 192.168.3.1

It´s a random order but affect the next-hop for redistribute routes in BGP.

This ASR 9k and XR 4.2.3. Maybe the order is for uptime of route?

thanks,

xthuijs · ‎12-11-2015

hi mauricio,

ah yeah that order that you are looking at is just "time driven" first one learnt/installed and second, third etc.

this doesn't mean that the route is more preferred or anything per-se.

if you look at the show cef <prefix>/<mask> detail location 0/<lc>/cpu0 you can see how it is programmed, you will see the LB index and bucket distribution there.

it is important also that you advertise routes with a loopback, this so that the bgp next hop is a loop that can be found via multipath properly.

and of course if there is BGP mpath, it needs to be configured with max-paths because normally bgp will only select one.

if you check the article on asr9000 route scale architecture or the cisco live ones id 2904 then you get a good overview especially in the orlando 2013 and sanfran 2014 one's for the additional detail on how the fib is built up and how to verify.

the san diego 2015 has even more details on CEF verification and understandings.

cheers

xander

pabloboscariol · ‎12-18-2015

Hello Vladimir,

Which kind of line cards are you trying to use with MGSCP, and which IOS XR version (+Service pack) are you running?

I'm running into an issue where the traffic that comes to my device labeled (MPLS VPN Label), it's not being balanced well in the output way to the SCE (VRF to global routing table).

best regards!

Pablo

smailmilak · ‎02-23-2016

Hello Xander,

recently we did an upgrade from 5.1.3 to 5.3.3 on 9010 with Trident LC.

After the upgrade we have unequable load balancing on the BE link with L3 sub-interfaces. Outbound traffic is 70/30. We have tried to shut down the interface with higher load and enable it again but it's still the same.

Do you know if there any changes on 5.3.3 and what we can do to get 50/50 load on the two interfaces?

On other ASR's it's working fine, but they are not running 5.3.3 so we can not say that it is because of the newer version.

I checked the release notes and there is nothing relevant to our case.

Any idea? Can I safely try this command?
RP/0/RSP0/CPU0:A9K-BNG#bundle-hash bundle-e 100 loc 0/0/cPU0

xthuijs · ‎02-23-2016

hi smail! the lb implementation should be the same between 51 and 53 from a hashing perspective.

the bundlehash command will help finding out what a particular source/dest pair would select what member so it wouldnt change the behavior or anything.

few things to check and try are the variation of s/d pairs, and if there is a bundle hash selection configured on the bundle to exclude certain info (such as L4 info). another thing is if you are possibly dealing with fragmented packets, which will exclude l4 info from the hashing (that got fixed in 533, whereby first fragments that do have l4 info, had their l4 info included, which should not since all frags need to follow the same hash).

if this is an l2 bundle, that is l2transport enabled and used under a bridge domain, then it could be that the hash selection is not set correct and may use mac adds instead of l3 info.

if everything is configured correctly, then another option may be to use the hash shift command to rotate the hash around and try to get some more equal spread that way.

cheers!

xander

smailmilak · ‎02-23-2016

Thanks for the prompt reply.

Configuration is ok. We did not change anything before and after the upgrade.

Load balancing method is Dst IP.

I will try with hash shift. Maybe it will solve the issue.

Thanks again and have a good day.

pabloboscariol · ‎03-09-2016

Solved changing vrf-mode to per-vrf ;)