With the growing AI and VXLAN fabrics in the industry, security demands also increase, especially when deploying K8s clusters. In some certain scenarios, especially in private K8s clouds, you might want to restrict client access to some certain workloads and pods. Cilium with eBPF-based networking merged into the Linux kernel can certainly provide superior capabilities.
However, from the VXLAN networking perspective, it might be ideal as well to restrict or somehow isolate your clients into separate VRFs/Tenants where a tenant can only see IP addresses belonging to that tenant in isolation of any other tenant.
Isovalent Enterprise for Cilium provides a true end to end VRF isolation between the K8s clusters and the client VRFs using L3VPN SRV6. The is the true definition of Tenancy on K8s cluster, where a conflicting Service IP range of Pods doesn’t really conflict with others as long as they are separated into different VRFs/Tenants. Such capabilities is truly end to end from the K8s side all the way through the Nexus VXLAN fabric towards any border leaf switches.
However, in this post, aside from SRV6 demonstrations, we will also demonstrate simple implementation of BGP communities between Cilium and Nexus VXLAN EVPN fabric that can provide some sort of multi-tenant capabilities. Yet, this is not a true tenancy as opposed to deploying L3VPN SRV6 configurations
Using BGP Attributes to Steer routes into VRFs (Not True End to End Multi-Tenant)
While there are many methods to steer BGP routes into separate VRFs, we will make the use of BGP communities here. Consider two pods in the deployment, Pod1 belongs to Tenant1 only, and IP routing table of Tenant1 should only have Service IP's or CIDR's belonging to this Pod or any Pods belonging to Tenant1, more specifically, Tenant 1 should be able to see 20.0.10.1/32 in the routing table, but should never see Pod2’s Service IP of 30.0.10.1. We will use simply BGP communities advertised from Cilium to achieve that. Cilium will add BGP community of 64512:301 to any advertisements of pods belonging to Tenant1, in this example, Pod2 will be advertised with a different community such as 64512:302.
The Cisco Nexus EVPN fabric can have multiple VRFs deployed, one of them will be a generic VRF facing the K8s nodes (Tenant-K8s), and this is the VRF where all BGP sessions will be established. Note that in this example, I used loopback in VRF called “Tenant-K8s”, the use of loopbacks might be needed when deploying VPCs to keep BGP sessions up in case of link failures. However, in this example, I have only one worker node with single link to a single leaf.
Checking on the BGP sessions:
From the Leaf Switch side:
Leaf01# show bgp vrf Tenant-K8s ipv4 unicast summary
BGP summary information for VRF Tenant-K8s, address family IPv4 Unicast
BGP router identifier 192.168.1.1, local AS number 65000
BGP table version is 325, IPv4 Unicast config peers 1, capable peers 1
6 network entries and 6 paths using 1800 bytes of memory
BGP attribute entries [5/1840], BGP AS path entries [1/6]
BGP community entries [2/88], BGP clusterlist entries [0/0]
4 received paths for inbound soft reconfiguration
0 identical, 4 modified, 0 filtered received paths using 64 bytes
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/
PfxRcd
192.168.16.35 4 64512 13305 13197 325 0 0 1d22h 4
From the worker node:
jawad@ubuntu1:~$ cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
ubuntu2.local 64512 65000 192.168.1.1 established 45h49m25s ipv4/unicast 2 6
The worker node has two pods, each with a different Service IP.
jawad@ubuntu1:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 20h
nginx2 1/1 Running 0 20h
jawad@ubuntu1:~$ kubectl describe ippools/ip-pool-pod1 | grep Cidr
Cidr: 20.0.10.0/24
jawad@ubuntu1:~$ kubectl describe ippools/ip-pool-pod2 | grep Cidr
Cidr: 30.0.10.0/24
jawad@ubuntu1:~$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4d17h
nginx-app NodePort 10.100.246.185 <none> 80:31705/TCP 4d17h
nginx-service LoadBalancer 10.110.206.139 20.0.10.1 80:32216/TCP 41h
nginx-service2 LoadBalancer 10.109.203.74 30.0.10.1 80:32489/TCP 40h
Cilium will advertise each Pod’s service IP with a different communities, we basically assign some labels for each Pod (which could represent a Tenant) and then Cilium can match those and assign the correct community. In the below example of Cilium BGP Advertisement config file, Pod1 will be advertised with community 64512:301 and Pod2 with community of 64512:302
jawad@ubuntu1:~/ciliumconfigs$ cat CiliumBGPAdvertisement.yaml
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
name: bgp-advertisements
labels:
advertise: bgp
spec:
advertisements:
- advertisementType: "Service"
service:
addresses:
- ClusterIP
- ExternalIP
- LoadBalancerIP
selector:
matchExpressions:
- { key: servicebgp, operator: In, values: [ proxy ] }
attributes:
communities:
standard: [ "64512:301" ]
- advertisementType: "Service"
service:
addresses:
- ClusterIP
- ExternalIP
- LoadBalancerIP
selector:
matchExpressions:
- { key: servicebgp, operator: In, values: [ proxy2 ] }
attributes:
communities:
standard: [ "64512:302" ]
Let’s have a look at what routes Cilium is advertising. We can clearly see two routes advertised; 20.0.10.1/32 which belongs to Tenant1 and also 30.0.10.1/32 which belongs to different Tenant according to our requirements. We are also adverting the Cluster IP’s as part of our demo for demonstration purposes.
jawad@ubuntu1:~$ cilium bgp routes
(Defaulting to `available ipv4 unicast` routes, please see help for more options)
Node VRouter Prefix NextHop Age Attrs
ubuntu2.local 64512 10.0.1.0/24 0.0.0.0 59h54m37s [{Origin: i} {Nexthop: 0.0.0.0}]
64512 10.109.203.74/32 0.0.0.0 40h29m5s [{Origin: i} {Nexthop: 0.0.0.0}]
64512 10.110.206.139/32 0.0.0.0 40h41m2s [{Origin: i} {Nexthop: 0.0.0.0}]
64512 20.0.10.1/32 0.0.0.0 20h2m1s [{Origin: i} {Nexthop: 0.0.0.0}]
64512 30.0.10.1/32 0.0.0.0 20h2m59s [{Origin: i} {Nexthop: 0.0.0.0}]
Now Let’s have a look at what the Nexus is reporting in VRF EVPN routing table facing the K8s cluster:
Leaf01# show bgp l2vpn evpn vrf Tenant-K8s
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 3435, Local Router ID is 10.70.0.11
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-i
njected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - b
est2
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.70.0.11:6 (L3VNI 100301)
*>l[5]:[0]:[0]:[24]:[192.168.1.0]/224
10.77.0.111 0 100 32768 ?
*>l[5]:[0]:[0]:[24]:[192.168.16.0]/224
10.77.0.111 0 100 32768 ?
*>l[5]:[0]:[0]:[32]:[10.109.203.74]/224
10.77.0.111 0 64512 i
*>l[5]:[0]:[0]:[32]:[10.110.206.139]/224
10.77.0.111 0 64512 i
*>l[5]:[0]:[0]:[32]:[20.0.10.1]/224
10.77.0.111 0 64512 i
*>l[5]:[0]:[0]:[32]:[30.0.10.1]/224
10.77.0.111 0 64512 i
Now Let’s look at the communities for each of the above two routes:
Leaf01# show bgp l2vpn evpn 20.0.10.1 vrf Tenant-K8s | grep Community
Community: 64512:301
Leaf01# show bgp l2vpn evpn 30.0.10.1 vrf Tenant-K8s | grep Community
Community: 64512:302
As long as the communities are advertised correctly, we can now do some treatments, we have another VRF called “Tenant1-Pods”, the goal is to import only EVPN routes that has community of 64512:301 into it. The 64512:302 community should not be imported into this VRF, as this VRF represents a different Tenant.
Leaf01# sh run | sec "vrf context Tenant-K8s"
vrf context Tenant-K8s
vni 100301
ip pim ssm range 232.0.0.0/8
rd auto
address-family ipv4 unicast
route-target both auto
route-target both auto evpn
export map K8-Export-Others
import map CommunityImport evpn
Leaf01# sh run | sec "vrf context Tenant1-Pods"
vrf context Tenant1-Pods
vni 100302
ip pim ssm range 232.0.0.0/8
rd auto
address-family ipv4 unicast
route-target both auto
route-target both auto evpn
import vrf advertise-vpn
Leaf01# sh run | sec "route-map K8-Export-Others"
route-map K8-Export-Others permit 10
match community Tenant1-Pods-BGP-Community
set extcommunity rt 65000:100302
Leaf01# sh run | sec "route-map CommunityImport"
route-map CommunityImport permit 10
match extcommunity Tenant1-Pods-RT
set community 64512:301
route-map CommunityImport permit 20
match tag 12346
set community 64512:301
route-map CommunityImport deny 30
Leaf01# sh ip community-list
Standard Community List Tenant1-Pods-BGP-Community
10 permit 64512:301
Leaf01# sh ip extcommunity-list
Standard Extended Community List Tenant1-Pods-RT
10 permit RT:65000:100302
Let’s have a look at the VRF belonging to Tenant1 (Tenant1-Pods), only the 20.0.10.1./32 should appear there, but not the 30.0.10.1/32 route.
Leaf01# show bgp l2vpn evpn vrf Tenant1-Pods
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 3435, Local Router ID is 10.70.0.11
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-i
njected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - b
est2
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.70.0.11:5 (L3VNI 100302)
*>l[5]:[0]:[0]:[24]:[10.71.1.0]/224
10.77.0.111 0 100 32768 ?
*>l[5]:[0]:[0]:[32]:[10.110.206.139]/224
10.77.0.111 0 64512 i
*>l[5]:[0]:[0]:[32]:[20.0.10.1]/224
10.77.0.111 0 64512 i
*>l[5]:[0]:[0]:[32]:[172.16.16.16]/224
10.77.0.111 0 100 32768 ?
It is obvious from the above output that we correctly imported the routes belonging to specific tenant into the correct routing table. We have used the BGP communities received from Cilium to filter and place the routes into the correct VRF/tenant. Though this simple implementation provides sort of "multi-tenant" capabilities, it is important to note that this is not a complete end to end tenancy unless you run real L3VPN services on the worker nodes, such as deploying Isovalent Enterprise for Cilium as we will see shortly.
Using L3VPN SRV6 with Native True End to End VRF/Tenants
Let’s consider this scenario where Isovalent Enterprise for Cilium is running on a K8s node. We might have several VRFs configured directly on the nodes. Let’s consider that Tenant 1 is represented by the VRF “Blue”. There is a seamless integration between EVPN and L3VPN SRV6, but we will focus here on the L3VPN part only.
Let’s first see how the BGP sessions are established:
root@ubuntuk8:/home/jawad/srv6# cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
ubuntuk8.local 65000 65000 2016:40:40:40::40 established 2h39m59s ipv6/unicast 0 2
ipv4/mpls_vpn 1 1
We can see from the above that Cilium is establishing both ipv6 unicast and VPNv4 bgp sessions with the Cisco Switch/Router. The IPv6 session – in this setup- will enable reachability to the exchanged Locator prefixes/SIDs for SRV6, while the VPNv4 session will carry VRF/Tenant routes including any advertised SIDs.
I am using a Cisco XRV virtual router in this lab, but it could be a Nexus 9K switch. Let’s see the IPV6 session from router’s side:
RP/0/RP0/CPU0:XRV#show bgp ipv6 unicast summary
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 10 10 10 10 10 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
2016:23:23:23:20c:29ff:fe73:137d
0 65000 355 353 10 0 0 02:53:04 2
Similarly, we can check the VPNv4 sessions between the Cilium Enterprise and the XRV router.
RP/0/RP0/CPU0:XRV#show bgp vpnv4 unicast summary
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
2016:23:23:23:20c:29ff:fe73:137d
0 65000 356 354 10 0 0 02:53:43 1
Let’s see what Locator/SID are being received from Cilium Enterprise.
RP/0/RP0/CPU0:XRV#sh route ipv6
B 2001:23:23:23::/64
[200/0] via 2016:23:23:23:20c:29ff:fe73:137d, 00:13:10
C 2016:23:23:23::/64 is directly connected,
2d10h, GigabitEthernet0/0/0/0
L 2016:23:23:23::40/128 is directly connected,
2d10h, GigabitEthernet0/0/0/0
L 2016:40:40:40::40/128 is directly connected,
2d10h, Loopback0
L cafe:cafe:100::/48, SRv6 Endpoint uN (shift)
[0/0] via ::, 2d10h
L cafe:cafe:100::/64, SRv6 Endpoint uN (PSP/USD)
[0/0] via ::, 2d10h
L cafe:cafe:100:e000::/64, SRv6 Endpoint uDT4
[0/0] via ::ffff:0.0.0.0 (nexthop in vrf blue), 00:21:04
L cafe:cafe:100:e001::/64, SRv6 Endpoint uDT4
[0/0] via ::ffff:0.0.0.0 (nexthop in vrf default), 00:21:04
L cafe:cafe:100:e002::/64, SRv6 Endpoint uDT6
[0/0] via ::, 00:21:04
B cafe:cafe:2be::/48
[200/0] via 2016:23:23:23:20c:29ff:fe73:137d, 00:13:10
This above advertised SID should be part of the locator prefix configured on the Cilium Enterprise, so let's check.
root@ubuntuk8:/home/jawad/srv6# kubectl get sidmanager -o yaml | grep "prefix: cafe"
prefix: cafe:cafe:2be::/48
Pod1 belongs to the blue VRF, and we expect that Cilium will advertise it only in the blue VRF, but before that, let’s check the IP address of that pod.
root@ubuntuk8:/home/jawad# kubectl get pods -n blue
NAME READY STATUS RESTARTS AGE
pod1-blue 1/1 Running 2 (41m ago) 167m
root@ubuntuk8:/home/jawad# kubectl describe -n blue pod pod1-blue | grep vrf
Labels: vrf=blue
root@ubuntuk8:/home/jawad# kubectl exec -it -n blue pod1-blue -- sh
/ # ifconfig | grep 10.23
inet addr:10.23.0.98 Bcast:0.0.0.0 Mask:255.255.255.255
So now we know that this pod is running with ipv4 address of 10.23.0.98 in the blue VRF, the XRV router should have this POD CIDR subnet inside the blue VRF, and it should be received with the locator prefix/SID belonging to cilium node.
RP/0/RP0/CPU0:XRV#show bgp vpnv4 unicast received-sids
Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Received Sid
Route Distinguisher: 10:10 (default for vrf blue)
Route Distinguisher Version: 23
*>i10.23.0.0/24 2016:23:23:23:20c:29ff:fe73:137d
cafe:cafe:2be:8b71::
*> 172.19.19.19/32 0.0.0.0 NO SRv6 Sid
Indeed, the next hop and the SID belong to this K8s node. For testing, I created a dummy loopback in the blue VRF, I have also advertised it in BGP under the blue VRF
RP/0/RP0/CPU0:XRV#sh run int lo10
interface Loopback10
vrf blue
ipv4 address 172.19.19.19 255.255.255.255
RP/0/RP0/CPU0:XRV#show bgp vpnv4 unicast vrf blue
Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10:10 (default for vrf blue)
Route Distinguisher Version: 10
*>i10.23.0.0/24 2016:23:23:23:20c:29ff:fe73:137d
0 100 0 ?
*> 172.19.19.19/32 0.0.0.0 0 32768 ?
Processed 2 prefixes, 2 paths
Great, so both routes exist in the correct VRF, one of them is advertised through Cilium in the Blue VRF, and the other is a dummy loopback on the router. let’s make sure that they exist only in the blue VRF and not for example in the underlay (global routing table).
RP/0/RP0/CPU0:XRV#sh route ipv4
C 192.168.16.0/24 is directly connected, 1d00h, GigabitEthernet0/0/0/0
L 192.168.16.40/32 is directly connected, 1d00h, GigabitEthernet0/0/0/0
L 192.168.40.40/32 is directly connected, 1d00h, Loopback0
You can clearly see that the routing table of the XRV router has no idea on those routes, it indeed resembles a true underlay in a fabric.
Let’s test connectivity from the blue pod itself:
root@ubuntuk8:/home/jawad# kubectl exec -it pod1-blue -n blue -- sh
/ # ping 172.19.19.19
PING 172.19.19.19 (172.19.19.19): 56 data bytes
64 bytes from 172.19.19.19: seq=0 ttl=253 time=2.747 ms
64 bytes from 172.19.19.19: seq=1 ttl=253 time=2.272 ms
64 bytes from 172.19.19.19: seq=2 ttl=253 time=2.390 ms
64 bytes from 172.19.19.19: seq=3 ttl=253 time=2.184 ms
64 bytes from 172.19.19.19: seq=4 ttl=253 time=2.255 ms
^C
--- 172.19.19.19 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 2.184/2.369/2.747 ms
Drilling into packet captures, we can clearly see that the ICMP packet request was sent to some SID locator.
ICMP Packet sent:
ICMP Packet reply:
Note that traffic for ICMP request was destined to cafe:cafe:100:e000::, you’ve probably guessed already that this SID was advertised by the XRV router inside the blue VRF. So let’s check.
RP/0/RP0/CPU0:XRV#show bgp vpnv4 unicast local-sids
Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Local Sid Alloc mode Locator
Route Distinguisher: 10:10 (default for vrf blue)
Route Distinguisher Version: 23
*>i10.23.0.0/24 NO SRv6 Sid - -
*> 172.19.19.19/32 cafe:cafe:100:e000:: per-vrf XRV
Processed 2 prefixes, 2 paths
The prefix 172.19.19.19/32 has a local SID of cafe:cafe:100:e000::, this is what the XRV router has as a local SID, while the 10.23.0.0/24 had a received SID from the Cilium’s side as was shown previously.
Cilium Enterprise supports SRv6 L3VPN, literally acting as a PE router exchanging routes using Segment Routing over IPv6 (SRv6). This feature allows us to create virtual private networks that can provide true end to send segmentation and multi-tenant environments to providing secure and isolated connectivity between Kubernetes clusters, data centers, and even the public clouds.
Since we have now a fully complete proper L3VPN/VRF routing table, we can now configure seamless integration with our EVPN fabric.
Special thanks to Cisco teams who created internal Cilium dcloud demo that had several examples on configuration of the Isovalent Cilium !
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: