cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1835
Views
10
Helpful
0
Replies

vPC Switch Failure, auto-recovery Behavior

mfarrenkopf
Level 1
Level 1

Nexus 93180YC running 7.0(3)I4(2).  That we need to upgrade notwithstanding, I think what we experienced is a misunderstanding of our understanding of the auto-recovery concept.  We had a hardware failure of the primary vPC switch.  It was a vPC peer link loss first, then keepalive.  Secondary switch suspended its interfaces as expected with loss of PL, but did not react immediately when keepalive failed, despite auto-recovery setting.  Connected host hardware saw network connectivity loss for 2.5 minutes before switch B brought its ports back up.

 

On May 8, at 14:05:22, switch A was vPC primary and experienced a hardware failure.  Its ports began to shut down (Po1 is our vPC peer link):

 

May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel1: Ethernet1/53 is down
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first operational port changed from Ethernet1/53 to Ethernet1/51
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/53 is down (Initializing)
May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

 

Switch B began to react.  It saw the failure of the peer link first, then the keepalive.

 

2021 May 8 14:05:24 switch-B %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary. If vfc is bound to vPC, then only ethernet vlans of that VPC shall be down.
2021 May 8 14:05:31 switch-B %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

 

We have auto-recovery configured.  Based on this document:  https://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf

 

The scenario is exactly as described in Figure 103, save that these are 93180YCs, not 7Ks.  We have auto-recovery configured.  The vPC configuration works as expected in day-to-day operation.  Switch B's configuration:

 

vpc domain 1
peer-switch
role priority 8192
system-priority 8192
peer-keepalive destination 10.1.1.1 source 10.1.1.2 vrf MAIN interval
400 timeout 3
delay restore 120
peer-gateway
auto-recovery
ip arp synchronize

 

So, keepalives every 400 ms, with 3 misses to timeout.

 

switch-B# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link

vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 44
Peer Gateway : Enabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled, timer is off.(timeout = 240s)
Delay-restore status : Timer is off.(timeout = 120s)
Delay-restore SVI status : Timer is off.(timeout = 10s)

 

The auto-recovery feature says a timeout of 240 seconds.

 

Were we amiss in expecting switch B to take back over immediately upon loss of keepalive communication?  And, if so, then what exactly is the auto-recovery feature for?  And should we manually tune that timeout value?

 

0 Replies 0