PIM Dense mode unexplainable problem

gianluca82 · ‎02-17-2012

Hi all,

I have two Cisco 3845 routers which receive a multicast stram via a tunnel interface, i.e Tunnel163 (PIM Dense mode is enabled).

These routers are both connected to a LAN segment (FastEthernet0/1/0) where receivers are.

I observe the following very strange behavior:

Router1# show ip mroute 224.100.6.163 output (extract):

(192.168.163.22, 224.100.6.163), 00:32:57/00:02:38, flags: T

Incoming interface: Tunnel163, RPF nbr 173.1.163.2

Outgoing interface list:

FastEthernet0/1/0, Prune/Dense, 00:31:33/00:02:40, A

Router1# show ip pim neighbor fastEthernet 0/1/0 (extract)

Neighbor Interface Uptime/Expires Ver DR

Address Prio/Mode

100.1.6.252 FastEthernet0/1/0 00:23:21/850 msec v2 1 / S P G

Router1#show ip igmp groups 224.100.6.163 (extract)

IGMP Connected Group Membership

Group Address Interface Uptime Expires Last Reporter Group Accounted

224.100.6.163 FastEthernet0/1/0 5d01h 00:02:33 100.1.6.11

Router1 is the assert winner (highest IP address), it sees igmp joins request, but it's pruning the interface.

It's really confusing to me. It happens sometimes and it lasts until I manually issue clear ip mroute *

Unfortunately I cannot migrate to Sparse Mode, so I have to fix this problem.

Any help is really appreciated.

rsimoni · ‎02-17-2012

Ciao Gianluca,

while this happens is the twin router (100.1.6.252) also in prune state or is it fowarding traffic downstream?

R

gianluca82 · ‎02-17-2012

Ciao,

the other router is also in prune state and thus there is no multicast traffic on the LAN (I checked with Wireshark).

Thanks for your interest,

Gianluca

Peter Paluch · ‎02-17-2012

Hi Gianluca,

When this issue occurs, what does the show ip igmp group command tell you when issued on both these routers?

Best regards,

Peter

gianluca82 · ‎02-17-2012

Hi,

please find the output of show ip igmp gruops below.

Thanks four your interest,

Gianluca

Router1#show ip igmp groups 224.100.6.163 (extract)

IGMP Connected Group Membership

Group Address Interface Uptime Expires Last Reporter Group Accounted

224.100.6.163 FastEthernet0/1/0 5d01h 00:02:33 100.1.6.11

Router2#show ip igmp groups 224.100.6.163 (extract)

IGMP Connected Group Membership

Group Address Interface Uptime Expires Last Reporter Group Accounted

224.100.6.163 FastEthernet0/1/0 00:08:12 00:02:54 100.1.6.12

Peter Paluch · ‎02-17-2012

Hi Gianluca,

Thank you for your response. Hmmm... the group is subscribed indeed. I see only two logical reasons for an interface to be in a pruned state for a particular group:

No active IGMP join state is on the interface (not your case here)
Arrival of a PIM Prune message, either from Router2 or from some possibly illegitimate source

I am considering exploring the second option. If you temporarily disabled the PIM on your Router2 (if that is permissible) and cleared the mroute table on your Router1, would the situation stabilize and would the interface continuously remain in Forward state?

Also consider using the ip pim neighbor-filter on your interface to decrease the possibility of PIM spoofing attacks.

Riccardo, any further ideas on this?

Best regards,

Peter

rsimoni · ‎02-17-2012

I was thinking of enabling some debugs on that group and see if we can get something useful out of it.

debug ip pim 224.100.6.163

also, I would check if there is some known issue on the IOS Gianluca is running. Gianluca, which release do you have on your c3845?

Riccardo

gianluca82 · ‎02-20-2012

Hi Peter and Riccardo,

thanks for your valuable help. Coming to your hints:

1) Even assuming that an illegitimate source is sending prune messages, wouldn't Router1 go on forwarding 224.100.6.163 traffic nevertheless because of local IGMP receivers?

2) Router1 and Router2 are quite loaded and I am a bit afraid about turning on debugging commands (this has already been cause of troubles in the past). I would prefer to keep this option as a last resort.

3) Current release is:

C3845-ADVIPSERVICESK9-M V15.0(1)M4 Release SW (fc1)

Ciao,

Gianluca

Peter Paluch · ‎02-20-2012

Hello Gianluca,

1) Even assuming that an illegitimate source is sending prune messages,  wouldn't Router1 go on forwarding 224.100.6.163 traffic nevertheless  because of local IGMP receivers?

Good question. If the Router1 was not the Assert winner then receiving a Prune message would cause it to stop forwarding the multicast stream despite its knowledge about subscribed stations.

2) Router1 and Router2 are quite loaded and I am a bit afraid about  turning on debugging commands (this has already been cause of troubles  in the past). I would prefer to keep this option as a last resort.

I am afraid we are closing to the last resort possibilities I believe that the debug commands Riccardo suggested should not produce excessive output or load, and will most probably give us some more hints about what is happening.

Best regards,

Peter

Peter Paluch · ‎02-23-2012

Hello Gianluca,

Any news in this matter?

Best regards,

Peter

gianluca82 · ‎02-25-2012

Hello Peter,

I have been analysing some sniffer captures on the LAN to which Router1 and Router2 are connected. I'm trying to check the exchange of PIM and OSPF packets to understand if something strange happens when the problem occurs. However, it is not an easy task

I'm also waiting for Roberto to let me know about any known IOS issue with the IOS. Finally, I'm monitoring the CPU load of the routers, I wonder if the problem is more likely to appear under stress conditions, but average CPU load is around 40%, which should not be a critical value I think.

Ciao

Gianluca,

Peter Paluch · ‎02-25-2012

Hello Gianluca,

I have a feeling that Riccardo merely suggested that it would be reasonable to look for known issues - you can visit the bug toolkit yourself at http://cisco.com/go/bugs . Needless to say, though, I'll try to reach out to him and find out if he did some internal search on this issue.

Regarding your sniffing work - I can imagine that it is difficult. I hope that the sniffer traces will help us narrow down the cause of the problem, though.

Best regards,

Peter

gianluca82 · ‎03-01-2012

Hi again Peter,

actually I'm not entitled to use the Cisco bug toolkit (I guess you need SMARTNET support or similar). This is why I would be extremely grateful if someone could help with this check.

Gianluca

Peter Paluch · ‎03-05-2012

Hello Gianluca,

I have found a couple of bug reports that could theoretically pertain to this behavior, however, they all should have been fixed in the IOS version you have now, so this leaves me somewhat confused. Still, do you have an option of upgrading to a 15.1M IOS?

Did you arrive to any conclusion after analyzing your sniffer traces?

Best regards,

Peter

gianluca82 · ‎03-11-2012

Dear Peter,

I have some news indeed. I have isolated the problem and understood wthat it is related to (but I'm not sure yet if it is a bug or expected behavior). I have also reproduced it in a simulation scenario. My feeling is that two factors play a fundamental role:

- having PIM State-Refresh configured on all the routers

- having an equal cost multi-path problem

Please refer to the diagram above, which is more or less the topology I have to deal with. C2 is the multicast source. R4 is the multicast IGMP receiver (actually it's a host, but it's represented as router because it was easier for me in order to setup up the simulation: it just joins the multicast group and does not take part to PIM nor to OSPF). R1<->R2 and R1<->R3 have the same cost. Also both LAN paths between R2 and R3 have the same cost. R3 in normal condition is the assert winner (equal metric towards the source but higher IP address) and multicast forwarder.

Now, if I shutdown R1<-->R3 connection, R2 becomes the assert winner (best metric towards the source, of course), but it remains in the pruned state (this is already strange). If I reactivate R1<->R3 link, R3 becomes the asserted winner again but it also remains in the pruned state and no traffic is forwarded on SW2. At least until I manually issue clear ip mroute*.

Now let's come to the interesting thing. The subnet on SW3 is higner (100.1.8.0/24) than the subnet on SW2 (100.1.6.0/24). When R3 looses the direct connection to R1, it has two equal cost paths toward C2 but it has to choose only one RPF interface. In this case it selects the neighbor with higher IP address (as expected according to PIM behavior) and thus the interface attached to SW3. Amazingly enough, if I change the subnet on SW3 in order to be lower than than the SW2 subnet (e.g 100.1.5.0/24) there is no issue at all and everythign works perfectly!

What's more, even if I disable PIM state refresh everywhere (without changing the subnets), there is no issue.

Sorry for the very long post. Things are a bit less obscure now, but still I don't clearly understand what happens. I can post the output of debug ip pim 224.100.6.163 if you think this can help!

Thanks,

Gianluca