Solved: VXLAN BGP EVPN - Why VLAN for L3 VNI?

blazarov86 · ‎08-06-2021

Hello Team,

While I understand the concept of VXLAN and particularly BGP EVPN control plane which i like very much, there is something that is bugging me for some time.

Why do we need to create a VLAN and an SVI (interface VLAN XXX with no ip address and "ip forwarding") for each L3 VNI (tenant VRF)? It looks to me more like a "glitch" in the inner workings of IOS-XE/CLI/etc.., rather than fundamental technology requirement.

I have no experience with VXLAN on other vendors, but based on quick online research for (say) Arista, it seems that you don't need a VLAN for each VRF.

Having a VLAN for L2 VNI makes perfect sense of course no question there.

Am i missing something here?

f00z · ‎08-10-2021

The reason is because vxlan is layer2 only, and to use layer3 the routing needs a destination/nexthop. So the nexthop of the routing part is to the vrf , which needs a mac address as a nexthop since vxlan is layer2, and the device provides that by using a vlan interface and vlan id.

The reason it needs a vlan is because it's a switch. If it's a router it can use a different target (routers have much higher capability of interface targets i.e. bridge groups. They don't use vlans as much as pushing a popping tags so the 4k vlan limitation is not there).

Switches only have space for the 4k vlans in their table and the IRB interface (SVI) uses one of these targets.

Also, VTEP need to know whether or not to perform a l3 lookup on the received vxlan encapsulated frame.. so for example if the VTEP receive a frame and it's destined to L2 VNI which is mapped to vlan endpoint, it knows to do a lookup in the MAC table for this frame and send it out. Respectively if the frame is destined to the L3 VNI (mapped to another vlan endpoint of course) and the destination MAC address is the address of the router (the SVI/IRB int) then it must do a layer3 lookup to perform routing locally.

Think of the vlan as a bridge group, an endpoint, where on routers you could have tens or hundreds of thousands but switches are limited.

It has to know whether or not to forward the frame on layer2, or do a routing lookup on it. Every device does independent routing lookups. Because vxlan is layer2 only, this tells it that it needs to do a routing lookup.

Example, you have a server on leaf1 and it uses anycast gw to get to server on leaf2 (two different subnets), the anycast gw mac address will be where the server sends the traffic to default gw (the anycast gw mac), this switch will do routing lookup and see the nexthop is leaf2, this nexthop has to be an adjacency, which is formed by the association from the other vtep which is advertising the information over EVPN, the egress encap db gets populated with the tunnel info and nexthop of the other device. It's a one-way tunnel like mpls is.

You can picture it like if the two switches were directly connected without using evpn or vxlan at all. Say they are connected with a trunk port and allowed vlan 50-100. That works fine on layer2 right, now you have another physical port connecting the two switches, but instead it's a router port with ip address x.x.x.x on both sides. That is basically what it is doing so it can differentiate between routed and non routed.

I suspect this design will change in the future because switch asics are no longer designed with constrained limits any more and technically could use bridge domains, but right now they are locked into using vlan ids probably due to BU stuff or just because that is what everyone is used to at the moment.

Ok I rambled on a bit there and maybe slightly incoherently. If that info makes sense let me know, if not ill try and clarify more.

View solution in original post

Sergiu.Daniluk · ‎08-06-2021

Hi @blazarov86

The role of the L3VNI is due to the nature of the IRB (Integrated Routing Bridging) implementation type on Cisco devices. More specifically - symmetric IRB.

How this one works? Very simple!

When two endpoints, located in different L2VNIs (different subnets), the VTEPs will encapsulate and forward the routed traffic over the L3VNI, as presented below:

On other vendors (Juniper, Arista) the IRB implementation is asymmetric IRB, meaning the concept of L3VNI does not exist, and the ingress VTEP is performing the routing between L2VNIs, as you see in the following picture:

Because of this different type of IRB mechanism, interconnecting two VXLAN fabrics with different IRB types (Cisco & Juniper for example) is not possible.

If you want to dig dipper into VXLAN, I would recommend the following book: Building Data Centers with VXLAN BGP EVPN: A Cisco NX-OS Perspective (the above pictures are from this book).

Stay safe,

Sergiu

f00z · ‎08-09-2021

Good reply, however, other vendors support both symmetric and asymmetric IRB along with other things like centralized gw, wheras cisco ONLY supports symmetric and anycast gw.

It is entirely possible to use juniper arista and cisco EVPN/vxlan together, I have it working in my lab; but since cisco forces you to use symmetric and anycast, that's the only supported method that works.

VNI for L3 is because vxlan/EVPN is technically layer 2 only, so it has to create a VNI per VRF (think of it as an MPLS label), and there's a MAC VRF for layer2 and a L3 VRF for Routing, each with (layer2)VNI. EVPN vxlan is similar to VPLS in many cases.

A combination of symmetric and asymmetric is usable as well in certain scenarios to make it behave more like a traditional l3vpn/l2vpn (mpls/vpls or whatever) type setup. Cisco named this 'feature' downstream VNI, but it isn't so much a feature as it is cisco allowing the ability to work instead of blocking it.

Just wanted to add that for clarification.

blazarov86 · ‎08-09-2021

Hello Sergiu,

Thanks for your reply, indeed it is helpful and brings additional clarity to the subject at hand.

However my post is primarily focused on the issue of "wasting" VLANs (scarce resource) for a function that has no real usage of VLAN and L2 switching whatsoever and its negative impact in real life use cases.

Accepting and taking into consideration all the facts around L3VNIs and symmetric/asymmetric IRB, i still find it unnecessary and kind of "arbitrary" to "waste" a VLAN for each L3VNI/VRF/Tenant. Even with all the technicalities, i could still see it implemented with some other type of virtual interface instead of VLAN interface.

And this is not just theoretical issue, i believe it has practical real life negative impact.

If you want to implement a typical multitenant DC environment with each tenant in a VRF and each tenant having one or more Subnets, and you want to push the scalability "to the edge" you could never have <total number of tenants> (1 VLAN per each) + <total number of subnets> (1 VLAN per each) more than 4k within the L2 domain.

Removing this VLAN per L3VNI requirement brings clear advantage which can reach 100% in the most extreme case (1 subnet per tenant).

Am i making sense or am i missing something?

f00z · ‎08-10-2021

I think you are missing the fact that it only 'wastes' a vlan on that paritcular leaf. VLAN is a local entitiry PER leaf device with EVPN. So if you had a customer on each port of a 48 port leaf it would use 48 vlans on that leaf , but those vlans are only used on that device and not elsewhere on the network. If a customer on that leaf on port 1 was assigned vlan 1234, and then the customer added another port on another leaf, that port could be assigned vlan 333 since the vlan # assignment itself is only local to each device and doesn't carry across the network. The VNI is now the real global identifier inside of the EVPN itself.

If you think you'll have 1000 vlans per leaf , maybe your design is wrong. Usually a leaf is a top of rack device serving a certain number of servers, or clients if it's a colo or DCI type setup.

While each leaf can only use 1000 or so vlans itself, the EVPN can have 16 million VNI , this is how the scale works. What most documentation fails to express is the vlan #s can be different everywhere , and for ease of understanding the docs use the same vlan # on every leaf as a way to show it mapped to VNI. If one customer is on 10 different leaf switches, it could be 10 different vlans (different # on each leaf) but still be the same vlan to the customer.

Does that make more sense?

Sergiu.Daniluk · ‎08-10-2021

To add to @f00z reply about the scalability: on one leaf you can currently configure a maximum of:

VLANs on VTEP node

Nexus 9200, 9300, 9300-EX, 9300-FX, 9364C, and 9500 switches and the X9700-EX/FX line cards

1700 (total VLANs)

1500 (VXLAN VLANs)

200 (non-VXLAN VLANs)

and fabric wide:

VXLAN Layer 2 VNIs	Nexus 9200, 9300, 9300-EX, 9300-FX, 9364C and 9500 switches and the X9700-EX/FX line cards	2000
VXLAN Layer 3 VNIs/VRFs	Nexus 9200, 9300, 9300-EX, 9300-FX, 9364C and 9500 switches and the X9700-EX/FX line cards	500

Reference: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/92x/scalability/guide_923/b_Cisco_Nexus_9000_Series_NX-OS_Verified_Scalability_Guide_923.html#id_91722

Also note that this are software dependent so it might increase in the future.

Stay safe,

Sergiu

blazarov86 · ‎08-10-2021

Hey @Sergiu.Daniluk and @f00z thanks for your further contribution - indeed on point.

Again, i confirm that I completely understand and accept your points on:

- VLANs being only locally significant (this is what i meant by within the L2 domain.)

- Other scalability bottlenecks on the platform lower than this one. These are obviously software and hardware dependent and hopefully subject to improvement.

Actually i believe these are pretty clear and well documented, so anybody interested would be able to absorb them.

What i wanted to achieve with this thread is to verify my intuition around the relationship between a VLAN and L3VNI being completely disconnected from the underlying technology and EVPN principles.

This has not been explicitly addressed so far in the thread, but it seems to be the case, right?

f00z · ‎08-10-2021