06-04-2012 04:36 PM - edited 03-01-2019 04:49 PM
vPC Auto-Recovery Feature in Nexus 7000.
There were two main reasons for this vPC enhancement:
We merge these two enhancements together into one feature starting from 5.2(1) called vPC auto-recovery.
Configuration of auto-recovery is pretty straightforward.
You just need to configure auto-recovery under vpc domain on both vPC peers
For eg:
On Switch S1
S1 (config)# vpc domain 1
S1(config-vpc-domain)# auto-recovery
S1# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 5
Peer Gateway : Enabled
Peer gateway excluded VLANs : -
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1-112,114-120,800,810
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
10 Po40 up success success 1-112,114-1
20,800,810
On Switch S2
S2 (config)# vpc domain 1
S2(config-vpc-domain)# auto-recovery
S2# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 5
Peer Gateway : Enabled
Peer gateway excluded VLANs : -
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1-112,114-120,800,810
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
40 Po40 up success success 1-112,114-1
20,800,810
We will take each behavior discussed in Why do we need vPC auto-recovery? section separately.
Assumption is that vPC auto-recovery is configured and saved to the start-up configuration on both switches S1 and S2.
1. Power outage shuts down both Nexus 7000 vPC peers simultaneously and only one switch is able to come up.
2. vPC peer-link is lost first and then primary vPC peer is power down.
Note:
As explained in both scenario, the switch which unsuspends its vPC role using vPC auto-recovery, will continue to remain primary even after peer-link is up. The other peer will take the role of secondary and will suspends its own vPC until consistency check is done.
For eg:
S1 is powered off. S2 becomes operational primary as expected. Peer-link and keepalive and all vpc links are disconnected from S1. S1 is not powered up. Since S1 is completely isolated, it will bring the vPC up (although physical links are down) due to auto-recovery and will take the role of Primary. Now, if we connect peer-link or keepalive between S1 and S2, S1 will keep the role of primary and S2 will be come secondary. This will cause S2 to suspend its vPC until both vPC peer-link and keepalive are up as well as consistency check is done. This will cause black holing of traffic since S2 vPC is in secondary and S1 physical links are down.
It is a good practice to enable auto-recovery in your vPC environment.
Although rare but there is a chance that vPC auto-recovery feature may get you in dual active scenario. For eg, if you first lost the peer-link and then you lost the keep-alive then you will have dual active scenario.
In this situation each vPC member port keeps advertising the same LACP ID as before the dual-active failure.
A vPC topology intrinsically protects from loops in case of dual-active scenarios. In the worst case, there will be duplicate frames. Despite this, as a loop prevention mechanism, each switch starts forwarding BPDUs with the same BPDU Bridge ID as prior to the vPC dual active failure.
While not intuitive, it is still possible and desirable to continue forwarding traffic from the access layer to the aggregation layer without drops for existing traffic flows, provided that the Address Resolution Protocol (ARP) tables are already populated on both Cisco Nexus 7000 Series peers for all needed hosts.
If new MAC addresses are to be learned by the ARP table, issues may arise because the ARP response from the server may always be hashed to one Cisco Nexus 7000 Series device and not the other, making it impossible for the traffic to flow correctly.
Suppose, however, that before the failure in the situation just described, traffic was equally distributed to both Cisco Nexus 7000 Series by a correct PortChannel and by Equal Cost Multipath (ECMP) configuration. In that case, serverto-server and client-to-server traffic continues with the caveat that single-attached hosts connected directly to the Cisco Nexus 7000 Series will not be able to communicate (for the lack of the peer link). Also, new MAC addresses learned on one Cisco Nexus 7000 Series cannot be learned on the peer, as this would cause flooding for the return traffic that arrives on the peer Cisco Nexus 7000 Series device.
For further details please refer to page 19 of
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/C07-572835-00_NX-OS_vPC_DG.pdf
Very good explaination.
We had Auto recovery on operational secondary as enabled but on Operational primary as disabled.
We ignored this part and during one of the QC test we shut down the peer-link from secondary side.
This did bring down most of the network. Still not sure what did happen there. Could you please help in understanding the behavior in this scenario.
Hi
If you can give more details them it may be helpful to find out what exactly went wrong.
When you shutdown the peer-link, was the peer-keepalive still active. If yes, then auto-recovery does not even kick in and by behavior, it will suspend all vpc on operational secondary.
So if you have orphan devices or devices just connected to operational secondary then outage would be expected behavior.
Viral
This is a really good document.
I'm deploying a 5K solution for an office desktop LAN at the moment. Each PC is plugged in to a dual homed fex. One thing that I have ofund is that in particular scenario, so when the primary goes down and the secondary is still up, the secondary would not bring the orphan ports back online. I have had to enter the vpc orphan port suspend command to allow them to come back online. It seems counter intuitive. Any ideas why it would behave like that?
Thanks.
Auto-recovery is going to work only for vpc port-channels.
Moreover, by default, orphan port does not get suspend when peer-link goes down. You need to have "vpc orphan-port suspend" command configured if you want to suspend orphan port when peer-link goes down.
If orphan ports are getting suspended without this configuration then it is probably a bug. You can do show run vpc all to check it out if you have it configured or not.
thanks for this ,excellent and very useful.
"For eg, if you first lost the peer-link and then you lost the keep-alive then you will have dual active scenario." Thinks this statement is wrong in "Should I enable vPC auto-recovery " section because the dual active scenario occurs first the keepalive to be down and then next the peer-link.
Regards,
Ganapareddy Sudhakar
9989050050
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: