cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
698
Views
0
Helpful
2
Replies

BUM blackholing with N9K-C9364C as a RP and N3K-C3164Q as leafs

ss1
Level 1
Level 1

Dear friends,

Good day to everybody. We came across an issue with our topology regarding BUM blackholing. As a result of troubleshooting I think we have narrowed it down to a probable mroute issue across our N3K-C3164Q running as leafs. 

Let's see the topology please:

supportforums.cisco.com.4.png

The 9364 are working as anycast-RPs and underlay routers for the ECMP paths towards the three different VPC domains shown below - the first domain consists of 2 N3K-C3164Q and the other two domains are N9K-C9396PX. The links between the 9364C and leafs are Layer3 multipath ports - every port has an OSPF and PIM on in order to secure the redundancy of the underlay fabric. The hosts shown below is a switchport port-channel towards an end device. 

We have detected some unidirectional BUM issues while some hosts on the 3164. They can't get arp replies from each other, hence the traffic drops upon ARP expiration and restores when an ARP request takes place. 

 

We did some diagnosis and think that this issue is most probably narrowed down to an mroute issue on the 3164. I'm not sure if both devices in a VPC domain have to register all multicast sources in their mroute tables (i.e. populating more or less the same routing table but our 3164 aren't doing so). The mroute tables are 1:1 on each of the switches on the 9396 domains but the 3164 do not register all multicast sources.

Let me show a real example with an underlay group in a 3164.
IP: 10.128.122.12 is the secondary IP address on the designated loopback for nve and the 10.128.3.215 is the same thing on the 9396 side. The 3164-1 does not register 10.128.3.215 as a multicast source.

3164-1# show ip mroute 225.161.215.1
IP Multicast Routing Table for VRF "default"

(*, 225.161.215.1/32), uptime: 00:01:06, nve pim ip 
  Incoming interface: Ethernet1/27, RPF nbr: 10.183.161.13
  Outgoing interface list: (count: 1)
    nve1, uptime: 00:01:06, nve


(10.128.122.12/32, 225.161.215.1/32), uptime: 00:01:06, nve mrib pim ip 
  Incoming interface: loopback3, RPF nbr: 10.128.122.12
  Outgoing interface list: (count: 1)
    Ethernet1/45, uptime: 00:00:39, pim

3164-1# 

3164-2# show ip mroute 225.161.215.1
IP Multicast Routing Table for VRF "default"

(*, 225.161.215.1/32), uptime: 01:42:14, nve pim ip 
  Incoming interface: Ethernet1/59, RPF nbr: 10.183.162.17
  Outgoing interface list: (count: 1)
    nve1, uptime: 01:42:14, nve


(10.128.3.215/32, 225.161.215.1/32), uptime: 01:42:12, ip pim mrib 
  Incoming interface: Ethernet1/49, RPF nbr: 10.183.162.21
  Outgoing interface list: (count: 1)
    nve1, uptime: 01:42:12, mrib


(10.128.122.12/32, 225.161.215.1/32), uptime: 01:42:14, nve mrib pim ip 
  Incoming interface: loopback3, RPF nbr: 10.128.122.12
  Outgoing interface list: (count: 1)
    Ethernet1/55, uptime: 01:41:44, pim


3164-2#

 

I think this issue occurred past our upgrade from NX-OS 7 to NX-OS 9 but I can't be sure due to the fact I didn't expect this issue to occur prior to my upgrade, hence no mroute output had been saved. 

The 9396 domains don't have this issue though - both the local and remote sources are registered in the mroute table (sorry I have to display the output with another multicast group however the situation is the same with all others.

9396-1# show ip mroute 225.213.215.3
IP Multicast Routing Table for VRF "default"

(*, 225.213.215.3/32), uptime: 3w1d, nve ip pim 
  Incoming interface: Ethernet2/10, RPF nbr: 10.184.213.1
  Outgoing interface list: (count: 1)
    nve1, uptime: 3w1d, nve


(10.128.2.12/32, 225.213.215.3/32), uptime: 3w1d, nve mrib ip pim 
  Incoming interface: loopback3, RPF nbr: 10.128.2.12
  Outgoing interface list: (count: 0)


(10.128.3.215/32, 225.213.215.3/32), uptime: 3w1d, ip pim mrib 
  Incoming interface: Ethernet2/10, RPF nbr: 10.184.213.1
  Outgoing interface list: (count: 1)
    nve1, uptime: 3w1d, mrib

9396-2# show ip mroute 225.213.215.3
IP Multicast Routing Table for VRF "default"

(*, 225.213.215.3/32), uptime: 3w1d, nve ip pim 
  Incoming interface: Ethernet2/10, RPF nbr: 10.184.212.1
  Outgoing interface list: (count: 1)
    nve1, uptime: 3w1d, nve


(10.128.2.12/32, 225.213.215.3/32), uptime: 3w1d, nve mrib ip pim 
  Incoming interface: loopback3, RPF nbr: 10.128.2.12
  Outgoing interface list: (count: 1)
    Ethernet2/10, uptime: 01:55:33, pim


(10.128.3.215/32, 225.213.215.3/32), uptime: 3w1d, ip pim mrib 
  Incoming interface: Ethernet2/10, RPF nbr: 10.184.212.1
  Outgoing interface list: (count: 1)
    nve1, uptime: 3w1d, mrib

To sum it up we get the following situation on the 3164:

3164-1# show ip multicast vrf default
Multicast Routing VRFs (2 VRFs)
VRF Name              VRF      Table       Route   Group   Source  (*,G)   State
                      ID       ID          Count   Count   Count   Count

default               1        0x00000001  240     114     126     113     Up
    Multipath configuration (1): s-g-hash
    Resilient configuration: Disabled

3164-2# show ip multicast vrf default
Multicast Routing VRFs (2 VRFs)
VRF Name              VRF      Table       Route   Group   Source  (*,G)   State
                      ID       ID          Count   Count   Count   Count

default               1        0x00000001  363     114     249     113     Up
    Multipath configuration (1): s-g-hash
    Resilient configuration: Disabled

The same thing looks considerably better on the 9396:

9396-1# show ip multicast vrf default
Multicast Routing VRFs (3 VRFs)
VRF Name              VRF      Table       Route   Group   Source  (*,G)   State
                      ID       ID          Count   Count   Count   Count

default               1        0x00000001  326     109     216     109     Up
    Multipath configuration (1): s-g-hash
    Resilient configuration: Disabled

9396-2# show ip multicast vrf default
Multicast Routing VRFs (2 VRFs)
VRF Name              VRF      Table       Route   Group   Source  (*,G)   State
                      ID       ID          Count   Count   Count   Count

default               1        0x00000001  325     109     215     109     Up
    Multipath configuration (1): s-g-hash
    Resilient configuration: Disabled

I'm not sure if the difference in source count can bring any BUM blackholing but any feedback will be appreciated on how to diagnose this further.

Thank you!

 

2 Replies 2

ss1
Level 1
Level 1

Hello,

 

I have just read about the following command: 'ip pim pre-build spt'
Perhaps it would be a good option to try? What do you think?

 

Thank you!

ss1
Level 1
Level 1

ip pim pre-build spt didn't help.
I can also try with ip multicast multipath s-g-hash next-hop-based as I really think that this is some RPF issue, does anybody think it could be a good tryout as well? 

Review Cisco Networking for a $25 gift card