HSRP is instable

Erik Boon · ‎04-15-2013

Hello all,

I've got a problem of which I hope someone can help me. In our datacenter we've implemented HSRP on 2 6500's for redundancy purposes. Both switches are connected via a trunk. When an interface is administratively brought up, HSRP becomes instable. Below some selective logging:

12:58:01.759 CET: %HSRP-5-STATECHANGE: Vlan32 Grp 32 state Standby -> Active
12:58:01.919 CET: %HSRP-5-STATECHANGE: Vlan21 Grp 21 state Standby -> Active
12:58:02.031 CET: %HSRP-5-STATECHANGE: Vlan42 Grp 42 state Standby -> Active
12:58:02.031 CET: %HSRP-5-STATECHANGE: Vlan18 Grp 18 state Standby -> Active
12:58:02.223 CET: %HSRP-5-STATECHANGE: Vlan4 Grp 4 state Standby -> Active

12:58:04.331 CET: %HSRP-5-STATECHANGE: Vlan32 Grp 32 state Active -> Speak
12:58:04.667 CET: %HSRP-5-STATECHANGE: Vlan4 Grp 4 state Active -> Speak
12:58:04.847 CET: %HSRP-5-STATECHANGE: Vlan18 Grp 18 state Active -> Speak
12:58:05.367 CET: %HSRP-5-STATECHANGE: Vlan21 Grp 21 state Active -> Speak
12:58:07.767 CET: %HSRP-5-STATECHANGE: Vlan42 Grp 42 state Active -> Speak

12:58:14.855 CET: %HSRP-5-STATECHANGE: Vlan4 Grp 4 state Speak -> Standby
12:58:15.415 CET: %HSRP-5-STATECHANGE: Vlan21 Grp 21 state Speak -> Standby
12:58:15.479 CET: %HSRP-5-STATECHANGE: Vlan32 Grp 32 state Speak -> Standby
12:58:16.599 CET: %HSRP-5-STATECHANGE: Vlan18 Grp 18 state Speak -> Standby
12:58:18.536 CET: %HSRP-5-STATECHANGE: Vlan42 Grp 42 state Speak -> Standby

Basically what happens, is that both switches becomes active and thus are forwarding traffic. After a few seconds all is back to normal. It seems they are missing each others "hello messages", so the state change is in this case normal outcome. What I can't figure out', is the root cause. Since it is triggered by bringing up an random interface configured as a dot1q trunk, I'm thinking of STP limits. But the limitations I found are 10.000 active STP logical ports and 1800 virtual ports per slot. In my case this is 2591 logical ports and all the virtual ports per slot are below 1800. This suggest the switch is capable of running this set-up without a problem.

Some extra information:
-Sup 720 10GE
-Version 12.2(33)SXH2a
-No Vss used
-No drops on trunked interfaces between the 2 core switches
-83 standby groups (max256)

-R-PVST

I hope someone has some useful tips.

Thanks in advance!

InayathUlla Sharieff · ‎04-15-2013

Hi Erik,

Currently issue has been resolved hence cannot look what exactly was happening during that time.

Main thing to check was:

HSRP Flapping mainly due to ACtive HSRP hellos missed.

A random, momentary loss of data communication between the peers is the

most common problem that results in these messages. There are several

possible causes for the loss of HSRP packets between the peers.

1. HSRP state changes are often due to High CPU Utilization.

2. Physical layer problems

3. Excessive network traffic caused by spanning tree issue

4. Excessive traffic caused by each Vlan.

These error messages describe a situation in which a standby HSRP router did not receive three successive HSRP hello packets from its HSRP peer. The output shows that the standby router moves from the standby state to the active state. Shortly thereafter, the router returns to the standby state. Unless this error message occurs during the initial installation, an HSRP issue probably does not cause the error message. The error messages signify the loss of HSRP hellos between the peers. When you troubleshoot this issue, you must verify the communication between the HSRP peers. A random, momentary loss of data communication between the peers is the most common problem that results in these messages. HSRP state changes are often due to High CPU Utilization. If the error message is due to high CPU utilization, put a sniffer on the network and the trace the system that causes the high CPU utilization.

There are several possible causes for the loss of HSRP packets between the peers. The most common problems are physical layer problems, excessive network traffic caused by spanning tree issues or excessive traffic caused by each Vlan. As with Case Study #1, all the troubleshooting modules are applicable to the resolution of HSRP state changes, particularly the Layer 3 HSRP Debugging.

If the loss of HSRP packets between peers is due to excessive traffic caused by each VLAN as mentioned, you can tune or increase the SPD and hold the queue size to overcome the input queue drop problem.

In order to increase the Selective Packet Discard (SPD) size, go to the configuration mode and execute these commands on the Cat6500 switches:

(config)# ip spd queue max-threshold 600

!--- Hidden Command

(config)# ip spd queue min-threshold 500

!--- Hidden Command

Note: Refer to Understanding Selective Packet Discard (SPD) for more information on the SPD.

In order to increase the hold queue size, go to the VLAN interface mode and execute this command.:

(config-if)# hold-queue 500 in

After you increase the SPD and hold queue size, you can clear the interface counters if you execute the 'clear counter interface'command.

HTH

Regards

Inayath

*Plz rate all usefull posts.

Erik Boon · ‎04-16-2013

Hi Inayath,

Thanks for your great response!

Looking at the CPU, the utilization is 7% and 20% at peak moments. Although I'm not 100% sure as the shortest time range to check the CPU utilization is 1 minute average. The flapping happens in a few seconds, so the 1 minute average could be for example 20% while there actually was a high CPU utilization for a few seconds. That said, I don't think that's the cause, looking at the normal utilization and the flapping is always triggered by bringing up an interface.

Checked the physical layer following the document and all looks good. No errors or whatsoever.

I will focus on STP and debugging and let you know the outcome (can only debug in a maintenance window).

As for your question below: Do you still see the HSRP neighbors flapping on these devices?

The flapping is not continiously, most of the time it is triggered by bringing up an interface as described.

paul driver · ‎04-15-2013

Hello

Can you post the output for the 6500 interfaces configured with HRSP also show standby

Do you have hrsp timers configured ( the default hello 3sec Holdtime 10sec) , if so make sure they are not to low so hrsp thinks it neigbour isnt active.

res

Paul

Please don't forget to rate any posts that have been helpful.

Thanks.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Erik Boon · ‎04-16-2013

Hi pdriver,

Below the output for 1standby group/interface. The timers are default.

Switch 1

Vlan32 - Group 32
State is Standby
    833 state changes, last state change 6d21h
Virtual IP address is 172.16.1.1
Active virtual MAC address is 0000.0c07.ac20
    Local virtual MAC address is 0000.0c07.ac20 (v1 default)
Hello time 3 sec, hold time 10 sec
    Next hello sent in 2.272 secs
Preemption enabled
Active router is 172.16.1.3, priority 110 (expires in 9.984 sec)
Standby router is local
Priority 90 (configured 90)
IP redundancy name is "hsrp-Vl32-32" (default)

interface Vlan32
ip vrf forwarding SERVE
ip address 172.16.1.2 255.255.255.0
no ip redirects
no ip unreachables
no ip proxy-arp
standby 32 ip 172.16.1.1
standby 32 priority 90
standby 32 preempt

Switch 2

Vlan32 - Group 32
State is Active
    7 state changes, last state change 3y29w
Virtual IP address is 172.16.1.1
Active virtual MAC address is 0000.0c07.ac20
    Local virtual MAC address is 0000.0c07.ac20 (v1 default)
Hello time 3 sec, hold time 10 sec
    Next hello sent in 2.352 secs
Preemption enabled
Active router is local
Standby router is 172.16.1.2, priority 90 (expires in 11.696 sec)
Priority 110 (configured 110)
IP redundancy name is "hsrp-Vl32-32" (default)

interface Vlan32
ip vrf forwarding SERVE
ip address 172.16.1.3 255.255.255.0
no ip redirects
no ip unreachables
no ip proxy-arp
standby 32 ip 172.16.1.1
standby 32 priority 110
standby 32 preempt

InayathUlla Sharieff · ‎04-16-2013

Hi Erik,

Do you still see the HSRP neighbours flapping on these devices?

Regards

Inayath