<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic vPC Switch Failure, auto-recovery Behavior in Server Networking</title>
    <link>https://community.cisco.com/t5/server-networking/vpc-switch-failure-auto-recovery-behavior/m-p/4401398#M13234</link>
    <description>&lt;P&gt;Nexus 93180YC running 7.0(3)I4(2).&amp;nbsp; That we need to upgrade notwithstanding, I think what we experienced is a misunderstanding of our understanding of the auto-recovery concept.&amp;nbsp; We had a hardware failure of the primary vPC switch.&amp;nbsp; It was a vPC peer link loss first, then keepalive.&amp;nbsp; Secondary switch suspended its interfaces as expected with loss of PL, but did not react immediately when keepalive failed, despite auto-recovery setting.&amp;nbsp; Connected host hardware saw network connectivity loss for 2.5 minutes before switch B brought its ports back up.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;On May 8, at 14:05:22, switch A was vPC primary and experienced a hardware failure.&amp;nbsp; Its ports began to shut down (Po1 is our vPC peer link):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel1: Ethernet1/53 is down&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first operational port changed from Ethernet1/53 to Ethernet1/51&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/53 is down (Initializing)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Switch B began to react.&amp;nbsp; It saw the failure of the peer link first, then the keepalive.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;2021 May 8 14:05:24 switch-B %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary. If vfc is bound to vPC, then only ethernet vlans of that VPC shall be down.&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;2021 May 8 14:05:31 switch-B %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have auto-recovery configured.&amp;nbsp; Based on this document:&amp;nbsp; &lt;A href="https://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf" target="_blank"&gt;https://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The scenario is exactly as described in Figure 103, save that these are 93180YCs, not 7Ks.&amp;nbsp; We have auto-recovery configured.&amp;nbsp; The vPC configuration works as expected in day-to-day operation.&amp;nbsp; Switch B's configuration:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;vpc domain 1&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;peer-switch&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;role priority 8192&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;system-priority 8192&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;peer-keepalive destination 10.1.1.1 source 10.1.1.2 vrf MAIN interval &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;400 timeout 3&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;delay restore 120&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;peer-gateway&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;auto-recovery&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;ip arp synchronize&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, keepalives every 400 ms, with 3 misses to timeout.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;switch-B# show vpc&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Legend:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;(*) - local vPC is down, forwarding via vPC peer-link&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;vPC domain id : 1 &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Peer status : peer adjacency formed ok &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;vPC keep-alive status : peer is alive &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Configuration consistency status : success &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Per-vlan consistency status : success &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Type-2 consistency status : success &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;vPC role : secondary, operational primary&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Number of vPCs configured : 44 &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Peer Gateway : Enabled&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Dual-active excluded VLANs : -&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Graceful Consistency Check : Enabled&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Auto-recovery status : Enabled, timer is off.(timeout = 240s)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Delay-restore status : Timer is off.(timeout = 120s)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Delay-restore SVI status : Timer is off.(timeout = 10s)&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The auto-recovery feature says a timeout of 240 seconds.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Were we amiss in expecting switch B to take back over immediately upon loss of keepalive communication?&amp;nbsp; And, if so, then what exactly is the auto-recovery feature for?&amp;nbsp; And should we manually tune that timeout value?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 11 May 2021 19:25:05 GMT</pubDate>
    <dc:creator>mfarrenkopf</dc:creator>
    <dc:date>2021-05-11T19:25:05Z</dc:date>
    <item>
      <title>vPC Switch Failure, auto-recovery Behavior</title>
      <link>https://community.cisco.com/t5/server-networking/vpc-switch-failure-auto-recovery-behavior/m-p/4401398#M13234</link>
      <description>&lt;P&gt;Nexus 93180YC running 7.0(3)I4(2).&amp;nbsp; That we need to upgrade notwithstanding, I think what we experienced is a misunderstanding of our understanding of the auto-recovery concept.&amp;nbsp; We had a hardware failure of the primary vPC switch.&amp;nbsp; It was a vPC peer link loss first, then keepalive.&amp;nbsp; Secondary switch suspended its interfaces as expected with loss of PL, but did not react immediately when keepalive failed, despite auto-recovery setting.&amp;nbsp; Connected host hardware saw network connectivity loss for 2.5 minutes before switch B brought its ports back up.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;On May 8, at 14:05:22, switch A was vPC primary and experienced a hardware failure.&amp;nbsp; Its ports began to shut down (Po1 is our vPC peer link):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel1: Ethernet1/53 is down&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first operational port changed from Ethernet1/53 to Ethernet1/51&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/53 is down (Initializing)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;May 8 14:05:22 switch-A : 2021 May 8 14:05:22 PDT: %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Switch B began to react.&amp;nbsp; It saw the failure of the peer link first, then the keepalive.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;2021 May 8 14:05:24 switch-B %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary. If vfc is bound to vPC, then only ethernet vlans of that VPC shall be down.&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;2021 May 8 14:05:31 switch-B %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have auto-recovery configured.&amp;nbsp; Based on this document:&amp;nbsp; &lt;A href="https://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf" target="_blank"&gt;https://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The scenario is exactly as described in Figure 103, save that these are 93180YCs, not 7Ks.&amp;nbsp; We have auto-recovery configured.&amp;nbsp; The vPC configuration works as expected in day-to-day operation.&amp;nbsp; Switch B's configuration:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;vpc domain 1&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;peer-switch&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;role priority 8192&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;system-priority 8192&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;peer-keepalive destination 10.1.1.1 source 10.1.1.2 vrf MAIN interval &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;400 timeout 3&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;delay restore 120&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;peer-gateway&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;auto-recovery&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;ip arp synchronize&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, keepalives every 400 ms, with 3 misses to timeout.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;switch-B# show vpc&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Legend:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;(*) - local vPC is down, forwarding via vPC peer-link&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;vPC domain id : 1 &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Peer status : peer adjacency formed ok &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;vPC keep-alive status : peer is alive &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Configuration consistency status : success &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Per-vlan consistency status : success &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Type-2 consistency status : success &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;vPC role : secondary, operational primary&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Number of vPCs configured : 44 &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Peer Gateway : Enabled&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Dual-active excluded VLANs : -&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Graceful Consistency Check : Enabled&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Auto-recovery status : Enabled, timer is off.(timeout = 240s)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Delay-restore status : Timer is off.(timeout = 120s)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Delay-restore SVI status : Timer is off.(timeout = 10s)&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The auto-recovery feature says a timeout of 240 seconds.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Were we amiss in expecting switch B to take back over immediately upon loss of keepalive communication?&amp;nbsp; And, if so, then what exactly is the auto-recovery feature for?&amp;nbsp; And should we manually tune that timeout value?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 11 May 2021 19:25:05 GMT</pubDate>
      <guid>https://community.cisco.com/t5/server-networking/vpc-switch-failure-auto-recovery-behavior/m-p/4401398#M13234</guid>
      <dc:creator>mfarrenkopf</dc:creator>
      <dc:date>2021-05-11T19:25:05Z</dc:date>
    </item>
  </channel>
</rss>

