Solved: Randomly Duplicated Multicast packets

daxsmiddy · ‎09-18-2019

I'm looking for some help with interpreting mstat/mtrace output and looking for the source of randomly duplicated multicast packets from one site in a 'WAN' network.

The network:

Two physical sites, connected via an ISP private fiber link.

SIte1: 4510R-E core. Many VLANs and subnets. Multicast tx and rx on this side is all fine.

Exit route on the private fiber is via an SVI (10.99.1.2 in the samples below) in a dedicated vlan that exists on each end. The static RP that is configured in both switches is an SVI on this switch. The receivers are directly connected to this switch.

Site2: 3850 core. VLANs and subnets. For production, multicast on this side is transmit only. Inbound from the site 1 is via SVI (10.99.1.1 in the samples below). Multicast traffic originates from access switches behind this switch.

'pim sparse-mode' is set on all the vlan SVIs that need to participate in mc traffic.

The issue: At multicast receivers in site1, I see frequent multicast packets coming from site2 are duplicated. Nowhere near 100%, but a significant amount. Maybe as high as 15-20% at times. Any packets originating in site1 are normal, no duplication.

If I set up a test receiver in site2, I see the same kind of duplication from transmitters in both sites.

Both switches are 'ip multicast-routing' and have 'ip pim rp-address 10.1.131.1' which is an SVI on the 4510, which has sparse-mode set. (This was initially set to a different SVI that was not pim enabled, but the change seems to have made no difference).

Here are some mstat outputs. I don't really know how to interpret these, and googling hasn't provided me any great explanations. Something that's interesting to me is that 10.10.152.1 and 10.99.1.1 are two SVIs on the same switch, so I guess that shows a hop because it's routed from one vlan to another? Likewise with 10.99.1.2 and 10.1.40.1
I've been digging at this for several days and I still don't feel like I know what to look for.

NOCORE01>mstat 10.10.152.50 10.1.40.70 239.192.1.1
Type escape sequence to abort.
Mtrace from 10.10.152.50 to 10.1.40.70 via group 239.192.1.1
From source (?) to destination (?)
Waiting to accumulate statistics......
Results after 10 seconds:

  Source        Response Dest   Packet Statistics For     Only For Traffic
10.10.152.50    10.99.1.2       All Multicast Traffic     From 10.10.152.50
     |       __/  rtt 3    ms   Lost/Sent = Pct  Rate     To 239.192.1.1
     v      /     hop 3    ms   ---------------------     --------------------
10.10.152.1
10.99.1.1       ?
     |     ^      ttl   0
     v     |      hop 0    ms    0/2 = --%      0 pps    0/0 = --%  0 pps
10.99.1.2
10.1.40.1       ? Reached RP/Core
     |      \__   ttl   1
     v         \  hop 0    ms        11         1 pps           0    0 pps
10.1.40.70      10.99.1.2
  Receiver      Query Source

SHCORE01>mstat 10.10.152.50 10.1.40.70 239.192.1.1
Type escape sequence to abort.
Mtrace from 10.10.152.50 to 10.1.40.70 via group 239.192.1.1
From source (?) to destination (?)
Waiting to accumulate statistics......
Results after 10 seconds:

  Source        Response Dest   Packet Statistics For     Only For Traffic
10.10.152.50    10.10.152.1     All Multicast Traffic     From 10.10.152.50
     |       __/  rtt 21   ms   Lost/Sent = Pct  Rate     To 239.192.1.1
     v      /     hop 653  ms   ---------------------     --------------------
10.10.152.1
10.99.1.1       ?
     |     ^      ttl   0
     v     |      hop 0    ms    0/2 = --%      0 pps    0/0 = --%  0 pps
10.99.1.2       ? Reached RP/Core
     |      \__   ttl   1
     v         \  hop -633 ms        0         0 pps           0    0 pps
10.1.40.70      10.10.152.1
  Receiver      Query Source

daxsmiddy · ‎09-26-2019

This appears to have been resolved by changing the sg-expiry-timer.

The environment is many multicast sources to 1 receiver. Apparently the core switch on the far side was having difficulty keeping the mroute table updated.

(config)# ip pim sparse sg-expiry-timer 3600

The software vendor finally came through with information that this is "common" and that they've seen it before. I didn't run into any information online during my troubleshooting and testing where anyone suggested this as a cause of duplicated mcast packets. Best guess is that's because most environments are going to be few sources to many receivers.

View solution in original post

daxsmiddy · ‎09-19-2019

I noticed late yesterday that for a lot of the "duplicate" packets, the TTL is incremented by one, but for a significant number of them, it is not. They are EXACTLY duplicated when they arrive at the receiver. Unsure what that might mean.

daxsmiddy · ‎09-26-2019

This appears to have been resolved by changing the sg-expiry-timer.

The environment is many multicast sources to 1 receiver. Apparently the core switch on the far side was having difficulty keeping the mroute table updated.

(config)# ip pim sparse sg-expiry-timer 3600

The software vendor finally came through with information that this is "common" and that they've seen it before. I didn't run into any information online during my troubleshooting and testing where anyone suggested this as a cause of duplicated mcast packets. Best guess is that's because most environments are going to be few sources to many receivers.