Hello all,
if a PIM DR Router fails, the backup router will take over the PIM DR after a PIM hello timeout and the IGMP querier after the querier timeout.
The querier timeout is set to 120 seconds by default an cannot be lowered to less than 60 seconds by configuration. That’s a problem in our environment, because there is an requirement, to limit traffic interruptions to max. 5 seconds if a router fails.
The setup is a traditional 3-layered network design (core, distribution, access). Each of the distributionrouters are connected to one of the two core routers. Access switches are connected to both distri-routers. The link to the secondary router is blocked by spanning tree. Distri-routers are 6880-X-LE.
[CORE1] --- [DISTRI1] \ <- active PIM-DR and IGMP querier
| | [SWITCH]
[CORE2] --- [DISTRI2] /
After the primary router fails, the secondary router takes over the PIM DR after about 3 seconds (depending on PIM timers), but the receivers won’t receive the traffic yet. The igmp snooping forwarding port on the backup router points to the primary router (that now has failed). Because its uplink to the access switch was blocked by spanning-tree all the time, there were no IGMP messages seen on this interface so far. Therefore the system has to wait now until the backup router takes over the querier, send out his queries, and wait for the responses, so that IGMP snooping can take the uplink into the IGMP snooping table. In the worst case this can take up to 60 seconds, because of the lowest possible querier timeout.
I already tried to lower the query-interval und query-max-response-time values. But this almost increases the possibility that the failover occurs much faster. But the failovertime keeps randomly and in much cases it does not fit the requirement.
Is there any possibility to tune the system in that way, that the igmp querier failover can be obtained faster?
Many thanks in advance!
Regards - Hakan