cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
45391
Views
10
Helpful
37
Replies

HSRP-Issue: Both Routers Active

mullzkBern_2
Level 1
Level 1

I have a strange issue with HSRP on my Nexus7000 resulting in a Active/Active-State.
Does anyone see where the problem is founded or where I should look next?

Thx in advance and Greetings from Berne,
Stefan Mueller

Layout

  • 2 Nexus 7000 with NX-OS 5.1(3) as Distribution-Switch, with all the Access-Switches attached to each Nexus, bundled with vPC.
  • N7K Providing L3 with SVIs on 49 Vlans. Nexus1 always takes the IP x.11, Nexus2 is x.12. Default Gateway is x.10, provided via HSRP. 48 Vlans work out fine. 1 Vlan (with identical Configuration) has a Problem:

Issue

  • Both Nexus think that they are HSRP Active on Vl 783. Standby-Router is unknown.

Config-Snippet Nexus 1

interface Vlan783

ip address 10.34.195.11/25

ip router eigrp 41

ip passive-interface eigrp 41

hsrp 1

authentication text somethingelse

preempt

priority 150

timers msec 300 msec 1000

ip 10.34.195.10

no shutdown

Config-Snippet Nexus 2

interface Vlan783

ip address 10.34.195.12/25

ip router eigrp 41

ip passive-interface eigrp 41

hsrp 1

authentication text somethingelse

preempt

priority 130

timers msec 300 msec 1000

ip 10.34.195.10

no shutdown

debug hsrp engine packet hello interface vlan 783

=> on N2 (which should be Standby. IP: .12), only the following lines are repeating:

2011 Oct 11 16:58:36.880624 hsrp: Vlan783[1/V4]: Hello out Active pri 130 ip 10.34.195.10

2011 Oct 11 16:58:36.880651 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

2011 Oct 11 16:58:37.184802 hsrp: Vlan783[1/V4]: Hello out Active pri 130 ip 10.34.195.10

2011 Oct 11 16:58:37.184827 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

=> on N1 (which should be Active. IP: .11), I receive two Hellos for each Hello sent:

2011 Oct 11 17:07:56.405711 hsrp: Vlan783[1/V4]: Hello out Active pri 150 ip 10.34.195.10

2011 Oct 11 17:07:56.405735 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

2011 Oct 11 17:07:56.491349 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10

2011 Oct 11 17:07:56.491450 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

2011 Oct 11 17:07:56.491546 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10

2011 Oct 11 17:07:56.491559 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

2011 Oct 11 17:07:56.705691 hsrp: Vlan783[1/V4]: Hello out Active pri 150 ip 10.34.195.10

2011 Oct 11 17:07:56.705715 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

2011 Oct 11 17:07:56.791414 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10

2011 Oct 11 17:07:56.791437 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

2011 Oct 11 17:07:56.791532 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10

2011 Oct 11 17:07:56.791546 hsrp: Vlan783[1/V4]: hel 0 hol 0 auth somethingelse

Further Observations:

  • sh ip arp: N1 sees the SVI-address of N2 and vice-versa. Both of course have a ARP-Entry for the HSRP-address
  • sh mac add: N1 sees the N2-SVI-MAC on the vPC Peer-Link and vice-versa
  • Both N1 and N2 can ping all involved Addresses 10.34.195.10, 10.34.195.11 and 10.34.195.12 (and all Host-addresses as well)
  • Previously this morning, N1 could not ping SVI of N2 and Vice-Versa, although they could see each-other in the mac address-table (don't remember about arp-table). This also caused issues for End-Host-Traffic, notably DHCP. I then deleted hsrp-group 1, created hsrp-group 2 without authentication and with default-timers. This led to the same situation as above (Ping possible, HSRP both active), so I changed back to our standard-configuration.
  • The Vlan used to work at least three weeks ago. We are not aware of any relevant changes since then (we did attach more Access-Switches via vPC-Uplinks, though).
1 Accepted Solution

Accepted Solutions

mullzkBern_2
Level 1
Level 1

Hi there

As expected, the problem was not HSRP, but vPC. As seen in the debug above, the one Nexus said that he sent the HSRP Packets, but the other one did not receive it. Fact is, that the primary Nexus did not send anything over the vPC-Link in one specific Vlan - although the Vlan was allowed on the trunk.

Under the Hood, it seems to be related to Bug #CSCti95293 - though this bug has quite different symptoms (but this is what TAC and huge Show-Tech-Files are for). What is funny is that the bug was resolved under 5.0(5) and 5.1(1) and we have 5.1(3). The reason for this seems to be that we updated it from 5.0(3) via ISSU and the bug persisted this Soft Update. A reboot of the System probably would have helped it. However....

We could resolve the problem fairly easy: Disallowing and Readding the Vlan from the vPC (interface port-channel101; switchport trunk allowed vlan remove 783; switchport trunk allowed vlan all). Since then, the Vlan is forwarded on the vPC-Trunk, the Secondary-Nexus receives the the HSRP-Packets from the primary and changes its state to HSRP standby. Everything nice again.

Greetings from Berne

Stefan

View solution in original post

37 Replies 37

lffrwatson
Level 1
Level 1

I know it might sound simple, but did you confirm the authentication strings match on both devices once you recreated the HSRP groups again?

Hi Iffrwatson

Just confirmed the strings again, they match.

mullzkBern_2
Level 1
Level 1

Addendum: I had suspected that we hit some internal barrier - just as when we had created the 65th Vlan and the 2950-Switches dropped Spanning-Tree from a random, existing Vlan as it could handle only 64 SPT-Instances.

But that is not the case here: We created another new Vlan today, and HSRP is working perfectly there.

Hi Stefan,

I know, it would sound stange to you, bt can you the ip address on one of the vlan to 10.34.195.10 nd check if still is an issue.

PS: This is just a hit-and-trial troubleshooting suggestion.

Regards,

Smitesh

Hi Smitesh

Changing the SVI-IP to 10.34.195.10 does not work, as this address is already configured as HSRP-Address. And if I suspend the HSRP-IP in order to change the SVI-IP to .10, HSRP works even less

Another Troubleshooting-Step: Added another group HSRP 2 with just the ip 10.34.195.9, but no other custom configuration. This resulted in the same symptoms. So, the problem is not the HSRP Auth, Timers etc .

The important thing seems to be the Debug: The Active Router has Hello in + out, the Standby Router has only Hello out, but does not see any Hello in and therefor changes to hsrp active. The question ist, why to HSRP Hello from one of two Router on one of many Vlans get lost in Transit - I suspect not really a HSRP but a vPC-Problem.

Anyhow, I am opening a TAC-Case with it...

Greetings from Switzerland

Stefan Mueller

Hi,

Have you tried changing  to HSRP version 2 as you are using subsecond timers and with version 1 it sometimes doesn't work.

Regards.

Alain.

Don't forget to rate helpful posts.

Thanks for the hint, but changing to HSRP v2 did not help.

Hi Stefan,

Also, if you can verify port on switch connected opposite to this router is in same vlan

Regards,

Smitesh

nicholsm
Cisco Employee
Cisco Employee

As a side note, aggressive timers for HSRP are not required for vlans mapped to the peer-link in vPC configurations as both switches are actively forwarding in hardware for the virtual ip.

Try setting timers to default, remove authentication text (the default authentication password should be used as per http://www.cisco.com/en/US/docs/switches/datacenter/sw/5_x/nx-os/unicast/configuration/guide/l3_hsrp.html#wp1509606) and collect debugs from both sides and post them here.

Sent from Cisco Technical Support iPad App

Hi Matthew

Thanks for the hint with the default Timers, but the problem did not lie there. Here is the Debug with default Timers and Passwords:

Active Router

n91005-BGUSER# debug hsrp engine packet hello interface vl 783
n91005-BGUSER# 2011 Oct 21 08:35:11.096811 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:11.096834 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:11.096899 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:11.096912 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:13.255665 hsrp: Vlan783[1/V4]: Hello out Active pri 150 ip 10.34.195.10 
2011 Oct 21 08:35:13.255689 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:14.096805 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:14.096828 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:14.096893 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:14.096906 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:16.255611 hsrp: Vlan783[1/V4]: Hello out Active pri 150 ip 10.34.195.10 
2011 Oct 21 08:35:16.255793 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:17.078768 hsrp: Vlan783[1/V4]: Hello in from 10.34.195.12 State Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:17.078794 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 

=> Nexus with HSRP Prio 150 sees the packets from the Nexus with HSRP Prio 130 and therefor goes to HSRP active

Passive Router

n91006-BGUSER# debug hsrp engine packet hello interface vl 783
n91006-BGUSER# 2011 Oct 21 08:35:14.096183 hsrp: Vlan783[1/V4]: Hello out Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:14.096223 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:17.078017 hsrp: Vlan783[1/V4]: Hello out Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:17.078040 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:20.078013 hsrp: Vlan783[1/V4]: Hello out Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:20.078038 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 
2011 Oct 21 08:35:23.088461 hsrp: Vlan783[1/V4]: Hello out Active pri 130 ip 10.34.195.10 
2011 Oct 21 08:35:23.088487 hsrp: Vlan783[1/V4]: hel 3 hol 10 auth cisco 

=> Nexus with HSRP Prio 130 sees no Packets from Nexus with HSRP Prio 150 and therefor goes to HSRP active as well

Possible Conclusions

I see three possibilities:

1. n91005 (Active in Design) thinks that he sends the HSRP Packets, but does not really do so. I can't see any way to troubleshoot this scenario.

2. The L2-Communication between the two Nexus on this Vlan is faulty, the Active-HSRP-Packets are sent but not received.

3. n91006 (Passive in Design) receives the Hellos, but does not classify them as HSRP-Traffic and therefor ignores them. Is there any way to falsify this possibility?

The L2-Connection between two Nexus is my suspect #1, but there is neither real evidence nor a solution:

- The Vlan is setup exactly as all the other, troublefree Vlans.

- Both Nexus see the other one in the ARP-Table

- Both Nexus see the other one int the MAC-Table over the vPC peer-link:

n91006-BGUSER# sh mac addr vl 783
Legend: 
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link
   VLAN     MAC Address      Type      age     Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
G 783      0000.0c07.ac01    static       -       F    F  sup-eth1(R)
G 783      0026.980b.8c42    static       -       F    F  sup-eth1(R)
* 783      18ef.63e2.cc42    static       -       F    F  vPC Peer-Link

Where the Last Entry is the MAC-Adress of the Active-Per-Design-Nexus.

- We see no other L2-Problems in this Vlan.

- Why then suspect the Vlan? Vlan is configured in VTP (both Nexus VTP Server) and we had our share with VTP-Issues on the Nexus 7000 (although on another release)

- Vlan is mapped to vPC, and we had our share with vPC-Issues (although in another Design)

- One week ago, the Nexus had each other in the mac-table, but not in ARP and therefor could not ping each other. The issue was resolved by changing and reconfiguring HSRP, but something obviously was wrong.

Thanks for the data.

There are ways to troubleshoot this in the system, specifically performing debugs for netstack  and usng elam on both the bridge asic and the forwarding engine on the egress linecard to verify if the hellos are sent form the cpu, and the same tools on the switch that appears not to be seeing it.

Can you ping the SVI from one node to another? That would start to rule out the l2 connectivity. In vPC, pac address learning is disabled on the peer-link, and relies on CFSoE to program the macs. This is true for both the SVI macs as well, as it looks like you have peer-gatewy configured. None the less, HSRP v1 hellos are sent to 224.0.0.2, and should be flooded to the vlan. ONe way to check this is to sniff on any member port of vlan 783 and you should see hellos fro both peers.

Another question I have (forgive me if you stated this previously) is are you seeing this behavior on other vlans, IE where both nodes think they are active for the group?

If you have not done so already, might be time to open a TAC case and get some help, as this kind of troubleshooting could be too involved for the forum. I've been looking for a bug that looks familiar in 5.1(3) but can't seem to find one at present. More data will be required.

Thanks,

Matt

Hi Matt

- Pinging SVI from one node to another: Successful

- Sniffing on member Port of Vlan 783: I see hellos from both peers.

- Behaviour exists only in one Vlan - all others (with the same configuration) work fine.

- TAC case is open since yesterday.I too think that this is going too deep for the forum, but would like to thank you very much for your inputs... If we find a reason/solution with TAC, I will post it here...

Greetings from Bern,

Stefan

Ven Taylor
Level 4
Level 4

I remember seeing this a long time ago.

It sounds like you've got a typical "V" design where the two HSRP peers see one another through an access switch.

We use this too.  When we had this problem, it boiled down to no L2 path between the HSRP peers.

This happened to us if we had a trunking problem between one of our core switches and our access switches.

It would also happen if we didn't have the vlan created at the access switch.

Check that particular vlan and make sure it is configured everywhere it should be configured.

Ven

Ven Taylor

Hi Ven

I too suspect the L2-path, but I can't pin it down to some specific problem. The Direct Connection between the Distribution Switches seems to be ok, and there is no hint, that the traffic makes a detour to an Access Switch...

Review Cisco Networking for a $25 gift card