Nexus 93180YC running 7.0(3)I4(2). That we need to upgrade notwithstanding, I think what we experienced is a misunderstanding of our understanding of the auto-recovery concept. We had a hardware failure of the primary vPC switch. It was a vPC peer link loss first, then keepalive. Secondary switch suspended its interfaces as expected with loss of PL, but did not react immediately when keepalive failed, despite auto-recovery setting. Connected host hardware saw network connectivity loss for 2.5 minutes before switch B brought its ports back up.
On May 8, at 14:05:22, switch A was vPC primary and experienced a hardware failure. Its ports began to shut down (Po1 is our vPC peer link):
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel1: Ethernet1/53 is down
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first operational port changed from Ethernet1/53 to Ethernet1/51
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/53 is down (Initializing)
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
Switch B began to react. It saw the failure of the peer link first, then the keepalive.
2021 May 8 14:05:24 switch-B %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary. If vfc is bound to vPC, then only ethernet vlans of that VPC shall be down.
2021 May 8 14:05:31 switch-B %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
We have auto-recovery configured. Based on this document: https://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf
The scenario is exactly as described in Figure 103, save that these are 93180YCs, not 7Ks. We have auto-recovery configured. The vPC configuration works as expected in day-to-day operation. Switch B's configuration:
vpc domain 1
peer-switch
role priority 8192
system-priority 8192
peer-keepalive destination 10.1.1.1 source 10.1.1.2 vrf MAIN interval
400 timeout 3
delay restore 120
peer-gateway
auto-recovery
ip arp synchronize
So, keepalives every 400 ms, with 3 misses to timeout.
switch-B# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 44
Peer Gateway : Enabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled, timer is off.(timeout = 240s)
Delay-restore status : Timer is off.(timeout = 120s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
The auto-recovery feature says a timeout of 240 seconds.
Were we amiss in expecting switch B to take back over immediately upon loss of keepalive communication? And, if so, then what exactly is the auto-recovery feature for? And should we manually tune that timeout value?