Nexus 93180YC running 7.0(3)I4(2). That we need to upgrade notwithstanding, I think what we experienced is a misunderstanding of our understanding of the auto-recovery concept. We had a hardware failure of the primary vPC switch. It was a vPC peer link loss first, then keepalive. Secondary switch suspended its interfaces as expected with loss of PL, but did not react immediately when keepalive failed, despite auto-recovery setting. Connected host hardware saw network connectivity loss for 2.5 minutes before switch B brought its ports back up.
On May 8, at 14:05:22, switch A was vPC primary and experienced a hardware failure. Its ports began to shut down (Po1 is our vPC peer link):
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel1: Ethernet1/53 is down May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first operational port changed from Ethernet1/53 to Ethernet1/51 May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/53 is down (Initializing) May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
Switch B began to react. It saw the failure of the peer link first, then the keepalive.
2021 May 8 14:05:24 switch-B %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary. If vfc is bound to vPC, then only ethernet vlans of that VPC shall be down. 2021 May 8 14:05:31 switch-B %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
The scenario is exactly as described in Figure 103, save that these are 93180YCs, not 7Ks. We have auto-recovery configured. The vPC configuration works as expected in day-to-day operation. Switch B's configuration:
vpc domain 1 peer-switch role priority 8192 system-priority 8192 peer-keepalive destination 10.1.1.1 source 10.1.1.2 vrf MAIN interval 400 timeout 3 delay restore 120 peer-gateway auto-recovery ip arp synchronize
So, keepalives every 400 ms, with 3 misses to timeout.
switch-B# show vpc Legend: (*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1 Peer status : peer adjacency formed ok vPC keep-alive status : peer is alive Configuration consistency status : success Per-vlan consistency status : success Type-2 consistency status : success vPC role : secondary, operational primary Number of vPCs configured : 44 Peer Gateway : Enabled Dual-active excluded VLANs : - Graceful Consistency Check : Enabled Auto-recovery status : Enabled, timer is off.(timeout = 240s) Delay-restore status : Timer is off.(timeout = 120s) Delay-restore SVI status : Timer is off.(timeout = 10s)
The auto-recovery feature says a timeout of 240 seconds.
Were we amiss in expecting switch B to take back over immediately upon loss of keepalive communication? And, if so, then what exactly is the auto-recovery feature for? And should we manually tune that timeout value?