Solved: Nexus 5k change peer-keepalive link

Phillip Wilson · ‎04-20-2013

I have a pair of Nexus 5548UPs that have some high priority servers running on them. Servers are ESX hosts running Nexus 1000v's. Each host has multple connections in a VPC to both 5548s. We have been having intermittant ping loss and slowness of traffic to the VM's on these hosts. I was poking around trying to figure out what the issue could be and found that the peer-keepalive command was not set to send the heart beat across the mgmt0 interface. I would like to change this to point it accross the mgmt0 interface. Can I do this live without causing any issues? Anyone have any tips or advice for me on making this change with production servers running on the switches? I do not want to cause any loss to any systems when I make this change.

"Switch 2"

vpc domain 101
role priority 22222
peer-keepalive destination 172.27.1.18 source 172.27.1.19
auto-recovery

"Switch 1"

pc domain 101

role priority 11111

peer-keepalive destination 172.27.1.19 source 172.27.1.18

auto-recovery

I've also just noticed tonight that we are having a lot of input erros on one of the 10g links going from 5548-2 back to Core 6513-1. The link on 5548-1 back to core 6513-1 does not have any input errors. Also in the log it is showing that the interface keeps going down and back up. I'm thinking that the Peer link keep alive is the culprit to the VPC for this link going down and back up since it is not using the mgmt0.

2013 Apr 20 21:47:35 GWCP0-2 %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel98: Ethernet1/1 is down

2013 Apr 20 21:47:35 GWCP0-2 %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel98: port-channel98 is down

2013 Apr 20 21:47:35 GWCP0-2 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel98: first operational port changed from Ethernet1/1 to none

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel98 is down (No operational members)

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/1 is down (Initializing)

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel98 is down (No operational members)

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-SPEED: Interface port-channel98, operational speed changed to 10 Gbps

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DUPLEX: Interface port-channel98, operational duplex mode changed to Full

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel98, operational Receive Flow Control state changed to off

2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel98, operational Transmit Flow Control state changed to off

2013 Apr 20 21:47:39 GWCP0-2 %ETH_PORT_CHANNEL-5-PORT_UP: port-channel98: Ethernet1/1 is up

2013 Apr 20 21:47:39 GWCP0-2 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel98: first operational port changed from none to Ethernet1/1

2013 Apr 20 21:47:39 GWCP0-2 %ETHPORT-5-IF_UP: Interface Ethernet1/1 is up in mode trunk

2013 Apr 20 21:47:39 GWCP0-2 %ETHPORT-5-IF_UP: Interface port-channel98 is up in mode trunk

Ethernet1/1 is up

Dedicated Interface

Belongs to Po98

93 interface resets

30 seconds input rate 118480 bits/sec, 34 packets/sec

30 seconds output rate 61744 bits/sec, 18 packets/sec

Load-Interval #2: 5 minute (300 seconds)

input rate 113.92 Kbps, 28 pps; output rate 230.39 Kbps, 17 pps

RX

761957889 unicast packets 20849861 multicast packets 6478172 broadcast packets

789285922 input packets 349145216994 bytes

171626124 jumbo packets 0 storm suppression bytes

6 runts 0 giants 3557670 CRC 0 no buffer

3557676 input error 0 short frame 0 overrun 0 underrun 0 ignored

0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop

0 input with dribble 0 input discard

0 Rx pause

TX

336576988 unicast packets 107914 multicast packets 1665274 broadcast packets

338350176 output packets 189154059253 bytes

91051993 jumbo packets

0 output errors 0 collision 0 deferred 0 late collision

0 lost carrier 0 no carrier 0 babble 0 output discard

0 Tx pause

pille1234 · ‎04-21-2013

Hallo Phillip,

I believe something doesn't add up here. To the best of my knowledge a vpc domain never comes up whithout a working peer keep-alive link, so either the keep-alive link was originally there and has been removed later for whatever reason, or the peer adjacency has never really formed. Before you change anything have a look at 'show vpc' and compare it with the output provided by steve-fuller.

Besides, a disruption of the peer keep-alive link does not explain interface flappings, crc errors or packet loss at all. All that points to a layer 1 problem. Check the wiring of e1/1, replace the transceivers on both sides and the fiber if neccessary, then clear the error counters and check again if they increase. That would be my highest priority here.

Regards

Pille

View solution in original post

Steve Fuller · ‎04-21-2013

Hi Phillip,

You should be fine to migrate the peer-link to use the mgmt0 interface with no loss of connectivity on any vPC. As per the section vPC Peer-Keepalive Failure on page 29 0f the Cisco NX-OS Virtual PortChannel: Fundamental Design Concepts with NXOS 5.0:

If connectivity of the peer-keepalive link is lost but peer-link connectivity is not changed, nothing happens; both vPC peers continue to synchronize MAC address tables, IGMP entries, and so on. The peer-keepalive link is mostly used when the peer link is lost, and the vPC peers use the peer keepalive to resolve the failure and determine which device should shut down the vPC member ports.

And just to show it's OK, here's an example of what happens when I failed a vPC peer-keepalive link. Initially we can see the vPC is operational, the peer-link (Po1) is up, as are Po101 and Po102 which are vPC to my FEX:

ocs5548-1# sh vpc Legend: (*) - local vPC is down, forwarding via vPC peer-link vPC domain id : 1 Peer status : peer adjacency formed ok vPC keep-alive status : peer is alive Configuration consistency status: success Per-vlan consistency status : success Type-2 consistency status : success vPC role : secondary Number of vPCs configured : 67 Peer Gateway : Enabled Peer gateway excluded VLANs : - Dual-active excluded VLANs : - Graceful Consistency Check : Enabled vPC Peer-link status --------------------------------------------------------------------- id Port Status Active vlans -- ---- ------ -------------------------------------------------- 1 Po1 up 10,171-178 vPC status ---------------------------------------------------------------------------- id Port Status Consistency Reason Active vlans ------ ----------- ------ ----------- -------------------------- ----------- 101 Po101 up success success - 102 Po102 up success success - 102400 Eth101/1/1 down* Not Consistency Check Not - Applicable Performed 102401 Eth101/1/2 up success success 171 [..]

At this point my peer-keepalive link (via mgmt0 in this case) is operational:

ocs5548-1# sh vpc peer-keep vPC keep-alive status : peer is alive --Peer is alive for : (231853) seconds, (729) msec --Send status : Success --Last send at : 2013.04.21 08:37:34 620 ms --Sent on interface : mgmt0 --Receive status : Success --Last receive at : 2013.04.21 08:37:34 620 ms --Received on interface : mgmt0 --Last update from peer : (0) seconds, (346) msec vPC Keep-alive parameters --Destination : 192.168.1.6 --Keepalive interval : 1000 msec --Keepalive timeout : 5 seconds --Keepalive hold timeout : 3 seconds --Keepalive vrf : management --Keepalive udp port : 3200 --Keepalive tos : 192

When I shut the port on my out-of-band switch that connects to the mgmt0 interface I then see the peer-keepalive fail:

ocs5548-1# ter mon ocs5548-1# 2013 Apr 21 08:39:01.068 ocs5548-1 %IM-5-IM_INTF_STATE: mgmt0 is DOWN in vdc 1 2013 Apr 21 08:39:01.600 ocs5548-1 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

If I look at my peer-keepalive and confirm it has failed, but see the vPC operational state is as before the peer-keepalive failure:

ocs5548-1# sh vpc peer-keep vPC keep-alive status : peer is not reachable through peer-keepalive --Send status : Success --Last send at : 2013.04.21 08:39:58 620 ms --Sent on interface : --Receive status : Failed --Last update from peer : (62) seconds, (910) msec vPC Keep-alive parameters --Destination : 192.168.1.6 --Keepalive interval : 1000 msec --Keepalive timeout : 5 seconds --Keepalive hold timeout : 3 seconds --Keepalive vrf : management --Keepalive udp port : 3200 --Keepalive tos : 192 ocs5548-1# sh vpc Legend: (*) - local vPC is down, forwarding via vPC peer-link vPC domain id : 1 Peer status : peer adjacency formed ok vPC keep-alive status : peer is not reachable through peer-keepalive Configuration consistency status: success Per-vlan consistency status : success Type-2 consistency status : success vPC role : secondary Number of vPCs configured : 67 Peer Gateway : Enabled Peer gateway excluded VLANs : - Dual-active excluded VLANs : - Graceful Consistency Check : Enabled vPC Peer-link status --------------------------------------------------------------------- id Port Status Active vlans -- ---- ------ -------------------------------------------------- 1 Po1 up 10,171-178 vPC status ---------------------------------------------------------------------------- id Port Status Consistency Reason Active vlans ------ ----------- ------ ----------- -------------------------- ----------- 101 Po101 up success success - 102 Po102 up success success - 102400 Eth101/1/1 down* Not Consistency Check Not - Applicable Performed 102401 Eth101/1/2 up success success 171 [..]

My FEX are still on-line...

ocs5548-1# sh fex FEX FEX FEX FEX Number Description State Model Serial ------------------------------------------------------------------------ 101 FEX0101 Online N2K-C2232PP-10GE SSI155001QZ 102 FEX0102 Online N2K-C2232PP-10GE SSI15460AT7

And when the peer-keepalive is re-established, again we see no operational state changes to the vPC:

2013 Apr 21 08:47:19.068 ocs5548-1 %IM-5-IM_INTF_STATE: mgmt0 is UP in vdc 1 ocs5548-1# sh vpc peer-keep vPC keep-alive status : peer is alive --Peer is alive for : (29) seconds, (749) msec --Send status : Success --Last send at : 2013.04.21 08:47:49 30 ms --Sent on interface : mgmt0 --Receive status : Success --Last receive at : 2013.04.21 08:47:48 763 ms --Received on interface : mgmt0 --Last update from peer : (0) seconds, (719) msec vPC Keep-alive parameters --Destination : 192.168.1.6 --Keepalive interval : 1000 msec --Keepalive timeout : 5 seconds --Keepalive hold timeout : 3 seconds --Keepalive vrf : management --Keepalive udp port : 3200 --Keepalive tos : 192

Regards

pille1234 · ‎04-21-2013

Hallo Phillip,

I believe something doesn't add up here. To the best of my knowledge a vpc domain never comes up whithout a working peer keep-alive link, so either the keep-alive link was originally there and has been removed later for whatever reason, or the peer adjacency has never really formed. Before you change anything have a look at 'show vpc' and compare it with the output provided by steve-fuller.

Besides, a disruption of the peer keep-alive link does not explain interface flappings, crc errors or packet loss at all. All that points to a layer 1 problem. Check the wiring of e1/1, replace the transceivers on both sides and the fiber if neccessary, then clear the error counters and check again if they increase. That would be my highest priority here.

Regards

Pille

Phillip Wilson · ‎04-23-2013

I ended up shutting the interface off that had all the errors and it corrected my issues. I traced it down to a bad X2 module in my 6513.

Thanks for all the help!

Phil

Alcides Miguel · ‎04-07-2017

Hello, I know this thread is quite old.

But I need your help, I´ve configured two N5K and they're sharing the same vPC domain... and the peering Keep-alive is attached into mgmt0 in case of failure of one Keep-alive link all vPC(Port-channel) become unavailable.

Hope someone can assist in the right direction.

Regards,

Lucas Miguel