11-07-2011 01:16 PM - edited 03-07-2019 03:15 AM
We run a college campus with 4000+ clients and two core internal 7206VXRs setup to load balance each vlan's default gateway between them via GLBP. We use weighted load balancing as both routers are essentially identical in their specs and connections to the core. Our core switching is actually a stack of ether-channelled 3560Gs ( sad I know ) and our fiber distribution plant is a set of redundant 3750G Stacks that run dual fiber links out to each network closet. The closets are all Layer2. For the most part GLBP does exactly what I want, evenly balances traffic between the two core routers and if one dies ( tested ) the other picks up the slack seamlessly. However on occasion I see a massive amount of GLBP state changes where there routers change states from Active to Listen repeatedly.
i.e:
Nov 2 09:53:24 172.16.0.2 4123445: Nov 2 09:52:10: %GLBP-6-FWDSTATECHANGE: GigabitEthernet0/3.254 Grp 254 Fwd 1 state Listen -> Active
Nov 2 09:53:25 172.16.0.2 4123448: Nov 2 09:52:10: %GLBP-6-FWDSTATECHANGE: GigabitEthernet0/3.254 Grp 254 Fwd 1 state Active -> Listen ======================
Sometimes this occurs on multiple subinterfaces at the same time causing some pretty hefty router overhead. Here is how that particular sub-interface is configured on both routers:
Router 1:
interface GigabitEthernet0/3.254
encapsulation dot1Q 254
ip address 192.168.254.5 255.255.255.0
ip nbar protocol-discovery
ip flow ingress
ip pim sparse-dense-mode
glbp 254 ip 192.168.254.1
glbp 254 timers msec 250 msec 750
glbp 254 priority 150
glbp 254 preempt delay minimum 180
glbp 254 load-balancing weighted
glbp 254 authentication text ******
!
Router 2:
interface GigabitEthernet0/3.254
encapsulation dot1Q 254
ip address 192.168.254.7 255.255.255.0
ip nbar protocol-discovery
ip flow ingress
ip pim sparse-dense-mode
glbp 254 ip 192.168.254.1
glbp 254 timers msec 250 msec 750
glbp 254 priority 140
glbp 254 preempt delay minimum 180
glbp 254 load-balancing weighted
glbp 254 authentication text *****
!
====================================
All of our sub-interfaces are setup the same way. Does anything stand out as dead wrong? I don't have many peers that run full Cisco shops with GLBP implemented... Is the high frequency of GLBP events normal? Any advice would be great, even further reading or training on campus design (aside from the generic cisco Campus HA docs and First Hop redundancy docs I've been through all of them). Real world info on large scale GLBP deployments is very hard to find.
Thanks,
Jim Phillips
Network and Communications Support Technologist
Cambrian College
11-09-2011 07:33 PM
Hi Jim,
GLBP config looks fine and also as you advised it usually work fine and randomly start to flap (from active to listen). That means that GLBP hellos are missed on the way between two routers. Those should actually use the L2 path between these routers and I suppose through those 3750 switches. So there are several hypothetic reasons for Hellos being missed:
- High CPU on one of the switches in the path causing hellos to be dropped
- High CPU on one of the routers causing hellos to be dropped
- Uni-directional link occurrences dropping some traffic occasionally
- STP instability can also affect this
Thus in the event of reoccurrence I would first recommend to check these possibilities to see if you find any correlation and then if one located - troubleshoot that.
Hope this helps,
Nik
11-10-2011 07:13 AM
Thanks for the reply,
The 7206VXRs tie into several ether-channelled 3560Gs which make up our core switch stack. The 3750s act as our fiber distribution ( and also our STP Root ), the routers do not have direct connections to the 3750s.
I've also considered the possibility of dropped GLBP Hellos, the configuration above has the hello timers tweaked to 250 milliseconds and hold timers to 750 milliseconds. I am considering restoring the default timers of 3 and 10 seconds respectively to potentially reduce the chance of a missed Hello causing multiple GLBP state changes.
Anyone else have any wierd experience with GLBP tuning / state changes?
11-10-2011 09:50 PM
Hi Jim,
One more recomendation - it is always good to have hold timer to be bigger than 3 Hellos. E.G. Hold = 3*Hello+Y, where Y is some intervall.
Imagine this situation. We are triggering GLBP change whenever we loose 3 hellos.
So One router receives 1st hello at time 0. Next hello is expected in Hello time (250ms in your case), but L2 switch also adds some variable delay to the delivery, so the actual time of the next hello would be "250+X" where X is that variable delay.
Thus if 2 hellos were dropped on the way the router might already be waiting for 250ms +X1 + 250 + X2. And 3rd Hello will come not earlier than 250 + X3. Thus there is a big chance Hold timer to be triggered even if 3rd Hello successfully delivered. AS hold time is 750, and 3rd Hello can be delivered in "750 +X1+X2+X3".
Those X values are stochastic but I really saw these problems in practise. This is actually why default Hello is 3 sec but hold time is 3*hello +1 sec which is 10sec.
Hope this helps.
Nik
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide