VXLAN Multisite BUM via Multicast not playing ball

Nick Carlton · ‎12-26-2024

Just to preface this, im using n9k 9300v virtual switches with NXOS 10.3(5). Deployed on VMware ESXI

I am configuring VXLAN EVPN multi-site and have it all working with arp suppression enabled (host A in DC1 can ping host B in DC2). Reason its working like this, is the DCI no longer has to flood the ARP, so pretty much all VXLAN encapsulated traffic over the DCI is unicast directly between the DCI loopbacks. Using the arp suppression-cache locally on the VTEPs.

Reading the documentation, since 10.2, we can use Ingress Replication or Multicast for site-external BUM. I am using PIM Anycast RP (ASM) within each site, and this is working fine. The border gateways have the L2 VNIs configured to listen into this group. Using packet captures, I can see the traffic reaching the BGWs destined for the multicast group. However, no matter which version of the DCI I use (IR or Mcast), I cannot get the DCI to forward these BUM packets. Like I say, with arp suppression enabled, everything works fine, because no BUM traffic needs to go over the DCI.

When ingress replication didn't work, I wondered if it was something to do with the Virtual hardware. However, now Mcast isn't working either, im skeptical if there is something else in the mix. Reason being is I have mcast working intra-site just ok, so should work on the DCI?!

This is my topology:

And this is the configuration on the BGWs for NVE and multicast:

ip pim rp-address 10.0.0.98 group-list 239.0.0.0/24
ip pim rp-address 10.255.0.1 group-list 233.0.0.0/24
ip pim ssm range 232.0.0.0/8
ip pim anycast-rp 10.255.0.1 10.0.0.1
ip pim anycast-rp 10.255.0.1 10.0.0.2
ip pim anycast-rp 10.255.0.1 10.0.0.3
ip pim anycast-rp 10.255.0.1 10.0.0.4

interface loopback1
  ip address 10.255.0.1/32
  ip router ospf UNDERLAY area 0.0.0.0
  ip pim sparse-mode

interface nve1
  no shutdown
  host-reachability protocol bgp
  source-interface loopback0
  multisite border-gateway interface loopback100
  member vni 900101 associate-vrf
  member vni 900102 associate-vrf
  member vni 2001000
    multisite mcast-group 233.1.1.192
    mcast-group 239.0.0.1
  member vni 2001001
    multisite mcast-group 233.1.1.192
    mcast-group 239.0.0.1

The 10.0.0.x IPs are the loopback0 addresses for each core. All interfaces are configured with ip pim sparse-mode.

Like I say, I can see the ARP BUM traffic coming from a leaf all the way up to the BGW destined for the non-dci mcast group:

However, the DCI multicast group doesn't seem to work.

Not sure if im missing any configuration. Multicast IS functioning in the core:

DC1-CORE1(config)# ping multicast 233.1.1.192 interface eth1/1
PING 233.1.1.192 (233.1.1.192): 56 data bytes
64 bytes from 10.100.100.2: icmp_seq=0 ttl=254 time=5.752 ms
64 bytes from 10.100.100.2: icmp_seq=1 ttl=254 time=6.07 ms
64 bytes from 10.100.100.2: icmp_seq=2 ttl=254 time=6.598 ms
64 bytes from 10.100.100.2: icmp_seq=3 ttl=254 time=12.066 ms
64 bytes from 10.100.100.2: icmp_seq=4 ttl=254 time=10.432 ms

--- 233.1.1.192 ping multicast statistics ---
5 packets transmitted,
From member 10.100.100.2: 5 packets received, 0.00% packet loss
--- in total, 1 group member responded ---

DC2-CORE1(config)# ping multicast 233.1.1.192 interface eth1/1
PING 233.1.1.192 (233.1.1.192): 56 data bytes
64 bytes from 10.100.100.1: icmp_seq=0 ttl=254 time=4.403 ms
64 bytes from 10.100.100.1: icmp_seq=1 ttl=254 time=5.891 ms
64 bytes from 10.100.100.1: icmp_seq=2 ttl=254 time=7.207 ms
64 bytes from 10.100.100.1: icmp_seq=3 ttl=254 time=4.54 ms
64 bytes from 10.100.100.1: icmp_seq=4 ttl=254 time=4.838 ms

--- 233.1.1.192 ping multicast statistics ---
5 packets transmitted,
From member 10.100.100.1: 5 packets received, 0.00% packet loss
--- in total, 1 group member responded ---

I don't have any spare physical hardware to be able to configure to test. So im kinda stuck assuming its something with how the virtual stuff is handling it internally. Shame there is no ELAM on them, would have been handy. Either way, I can't prove it.

Hoping someone can shed some light for me. Happy to provide more info and show commands etc but didn't want to overload the post.

Edit: just reviewing the screenshots and noticed the source of the multicast packet in the screenshot, is the primary BGW in DC1. Which is where the packet capture was taken from. Explains why there are so may ARP messages. Each that come in from the spines are then reflected back out it seems:

In (10.0.1.10 is the source VTEP):

Out (10.0.0.1 is DC1-CORE1):

So its as though the BGW is intentionally not trying to forward it over the DCI for some reason but is keeping it inside the fabric by forwarding it back out based on the multicast group, not the multisite mcast group?