I'm trying to understand various vPC/Nexus failure scenarios and I'm not sure I understand the purpose of vPC auto-recovery.
Suppose I have a pair of Nexus 7k's (n7k-1 and n7k-2) with a 3rd device dual-homed via a functional vPC. If I powered off both 7k's, and only powered up n7k-1 (kept n7k-2 powered-off) then of course both (both!!!) the peer-link and keep alive link would be down. So why wouldn't n7k-1 enable the vPC? Isn't it a safe assumption to make if both (both!!!) the peer-link and keep alive link are down, that the far side (n7k-2) is really down?
In this explanation it says one of the reasons for auto-recovery is:
In a data center outage or power outage, both vPC peers comprising of Nexus 7000 Switches are down. Occasionally, only one of the peers can be restored. Since the other Nexus 7000 is still down, vPC peer-link as well as vPC peer-keepalive link are also down. In this scenario, vPC will not come up even for the Nexus 7000 which is already up. We had to remove all vpc configurations from the port-channel on that Nexus 7000 to get the port-channel working. When the other Nexus 7000 comes up then we have to again make configuration changes to include the vpc configuration for all vPC.Starting with 5.0(2), this behavior was taken care of by configuring reload restore command under vpc domain configuration.
That post mentions page 19 from this VPC fundamentals doc from 2010 where it states:
vPC Complete Dual-Active Failure (Double Failure) -
In case both the peer link and the peer keepalive link get disconnected, the Cisco Nexus switch does not bring down the vPC, because each Cisco Nexus Switch cannot discriminate between a vPC device reload and a peer-link- plus peer-keepalive failure. This means that each vPC member port keeps advertising the same LACP ID as before the dual-active failure.
That doesn't make sense to me.
So to get my desired behavior, I need to enable vpc auto-recovery? Why wouldn't that be enabled by default? Am I overlooking something obvious?
Another reason I ask is I've heard of people adding another trunk between the Nexus switches for "non VPC" vlans.
Can anyone comment on this? To me this just seems unnecessarily complex.
Both quotes represents 2 different situations:
Let's start from the 2nd one - in case both - peer-link and peer-keepalive get disconnected - means that one peer went down, so the second peer takes the primary role and forwards the traffic for VPC.
Main part here is that only BOTH peers were online and ONE peer wend down. This means that both peer ran elections for the role, then the secondary peer ran consistency check for vpc configuration, VPC came up and after that one peer went down. But the second peer already knows all PVC parameters, VPC election results etc and can run VPC without any issues.
Not the 1st quote - auto-recovery edscriprion: BOTH peers went down, only one has returned back online - peer-link and peer-keepalive are down. N7K cannot run elections, cannot perform a consistency-check and verify the VPC config - that's why VPC won't come up, and VPC linke won't be brought online. If you will run "sh vpc" at that moment - you will see messags similar to "VPC-peer never been online". Autorecovery was created to overcome this problem.
Switch will recover VPC state even without the peer.