04-20-2013 09:40 PM - edited 03-07-2019 12:56 PM
I have a pair of Nexus 5548UPs that have some high priority servers running on them. Servers are ESX hosts running Nexus 1000v's. Each host has multple connections in a VPC to both 5548s. We have been having intermittant ping loss and slowness of traffic to the VM's on these hosts. I was poking around trying to figure out what the issue could be and found that the peer-keepalive command was not set to send the heart beat across the mgmt0 interface. I would like to change this to point it accross the mgmt0 interface. Can I do this live without causing any issues? Anyone have any tips or advice for me on making this change with production servers running on the switches? I do not want to cause any loss to any systems when I make this change.
"Switch 2"
vpc domain 101
role priority 22222
peer-keepalive destination 172.27.1.18 source 172.27.1.19
auto-recovery
"Switch 1"
pc domain 101
role priority 11111
peer-keepalive destination 172.27.1.19 source 172.27.1.18
auto-recovery
I've also just noticed tonight that we are having a lot of input erros on one of the 10g links going from 5548-2 back to Core 6513-1. The link on 5548-1 back to core 6513-1 does not have any input errors. Also in the log it is showing that the interface keeps going down and back up. I'm thinking that the Peer link keep alive is the culprit to the VPC for this link going down and back up since it is not using the mgmt0.
2013 Apr 20 21:47:35 GWCP0-2 %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel98: Ethernet1/1 is down
2013 Apr 20 21:47:35 GWCP0-2 %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel98: port-channel98 is down
2013 Apr 20 21:47:35 GWCP0-2 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel98: first operational port changed from Ethernet1/1 to none
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel98 is down (No operational members)
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/1 is down (Initializing)
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel98 is down (No operational members)
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-SPEED: Interface port-channel98, operational speed changed to 10 Gbps
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_DUPLEX: Interface port-channel98, operational duplex mode changed to Full
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel98, operational Receive Flow Control state changed to off
2013 Apr 20 21:47:35 GWCP0-2 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel98, operational Transmit Flow Control state changed to off
2013 Apr 20 21:47:39 GWCP0-2 %ETH_PORT_CHANNEL-5-PORT_UP: port-channel98: Ethernet1/1 is up
2013 Apr 20 21:47:39 GWCP0-2 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel98: first operational port changed from none to Ethernet1/1
2013 Apr 20 21:47:39 GWCP0-2 %ETHPORT-5-IF_UP: Interface Ethernet1/1 is up in mode trunk
2013 Apr 20 21:47:39 GWCP0-2 %ETHPORT-5-IF_UP: Interface port-channel98 is up in mode trunk
Ethernet1/1 is up
Dedicated Interface
Belongs to Po98
93 interface resets
30 seconds input rate 118480 bits/sec, 34 packets/sec
30 seconds output rate 61744 bits/sec, 18 packets/sec
Load-Interval #2: 5 minute (300 seconds)
input rate 113.92 Kbps, 28 pps; output rate 230.39 Kbps, 17 pps
RX
761957889 unicast packets 20849861 multicast packets 6478172 broadcast packets
789285922 input packets 349145216994 bytes
171626124 jumbo packets 0 storm suppression bytes
6 runts 0 giants 3557670 CRC 0 no buffer
3557676 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop
0 input with dribble 0 input discard
0 Rx pause
TX
336576988 unicast packets 107914 multicast packets 1665274 broadcast packets
338350176 output packets 189154059253 bytes
91051993 jumbo packets
0 output errors 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 output discard
0 Tx pause
Solved! Go to Solution.
04-21-2013 04:15 AM
Hallo Phillip,
I believe something doesn't add up here. To the best of my knowledge a vpc domain never comes up whithout a working peer keep-alive link, so either the keep-alive link was originally there and has been removed later for whatever reason, or the peer adjacency has never really formed. Before you change anything have a look at 'show vpc' and compare it with the output provided by steve-fuller.
Besides, a disruption of the peer keep-alive link does not explain interface flappings, crc errors or packet loss at all. All that points to a layer 1 problem. Check the wiring of e1/1, replace the transceivers on both sides and the fiber if neccessary, then clear the error counters and check again if they increase. That would be my highest priority here.
Regards
Pille
04-21-2013 01:18 AM
Hi Phillip,
You should be fine to migrate the peer-link to use the mgmt0 interface with no loss of connectivity on any vPC. As per the section vPC Peer-Keepalive Failure on page 29 0f the Cisco NX-OS Virtual PortChannel: Fundamental Design Concepts with NXOS 5.0:
If connectivity of the peer-keepalive link is lost but peer-link connectivity is not changed, nothing happens; both vPC peers continue to synchronize MAC address tables, IGMP entries, and so on. The peer-keepalive link is mostly used when the peer link is lost, and the vPC peers use the peer keepalive to resolve the failure and determine which device should shut down the vPC member ports.
And just to show it's OK, here's an example of what happens when I failed a vPC peer-keepalive link. Initially we can see the vPC is operational, the peer-link (Po1) is up, as are Po101 and Po102 which are vPC to my FEX:
ocs5548-1# sh vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status: success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 67
Peer Gateway : Enabled
Peer gateway excluded VLANs : -
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 10,171-178
vPC status
----------------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
------ ----------- ------ ----------- -------------------------- -----------
101 Po101 up success success -
102 Po102 up success success -
102400 Eth101/1/1 down* Not Consistency Check Not -
Applicable Performed
102401 Eth101/1/2 up success success 171
[..]
At this point my peer-keepalive link (via mgmt0 in this case) is operational:
ocs5548-1# sh vpc peer-keep
vPC keep-alive status : peer is alive
--Peer is alive for : (231853) seconds, (729) msec
--Send status : Success
--Last send at : 2013.04.21 08:37:34 620 ms
--Sent on interface : mgmt0
--Receive status : Success
--Last receive at : 2013.04.21 08:37:34 620 ms
--Received on interface : mgmt0
--Last update from peer : (0) seconds, (346) msec
vPC Keep-alive parameters
--Destination : 192.168.1.6
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
When I shut the port on my out-of-band switch that connects to the mgmt0 interface I then see the peer-keepalive fail:
ocs5548-1# ter mon
ocs5548-1# 2013 Apr 21 08:39:01.068 ocs5548-1 %IM-5-IM_INTF_STATE: mgmt0 is DOWN in vdc 1
2013 Apr 21 08:39:01.600 ocs5548-1 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
If I look at my peer-keepalive and confirm it has failed, but see the vPC operational state is as before the peer-keepalive failure:
ocs5548-1# sh vpc peer-keep
vPC keep-alive status : peer is not reachable through peer-keepalive
--Send status : Success
--Last send at : 2013.04.21 08:39:58 620 ms
--Sent on interface :
--Receive status : Failed
--Last update from peer : (62) seconds, (910) msec
vPC Keep-alive parameters
--Destination : 192.168.1.6
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
ocs5548-1# sh vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is not reachable through peer-keepalive
Configuration consistency status: success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 67
Peer Gateway : Enabled
Peer gateway excluded VLANs : -
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 10,171-178
vPC status
----------------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
------ ----------- ------ ----------- -------------------------- -----------
101 Po101 up success success -
102 Po102 up success success -
102400 Eth101/1/1 down* Not Consistency Check Not -
Applicable Performed
102401 Eth101/1/2 up success success 171
[..]
My FEX are still on-line...
ocs5548-1# sh fex
FEX FEX FEX FEX
Number Description State Model Serial
------------------------------------------------------------------------
101 FEX0101 Online N2K-C2232PP-10GE SSI155001QZ
102 FEX0102 Online N2K-C2232PP-10GE SSI15460AT7
And when the peer-keepalive is re-established, again we see no operational state changes to the vPC:
2013 Apr 21 08:47:19.068 ocs5548-1 %IM-5-IM_INTF_STATE: mgmt0 is UP in vdc 1
ocs5548-1# sh vpc peer-keep
vPC keep-alive status : peer is alive
--Peer is alive for : (29) seconds, (749) msec
--Send status : Success
--Last send at : 2013.04.21 08:47:49 30 ms
--Sent on interface : mgmt0
--Receive status : Success
--Last receive at : 2013.04.21 08:47:48 763 ms
--Received on interface : mgmt0
--Last update from peer : (0) seconds, (719) msec
vPC Keep-alive parameters
--Destination : 192.168.1.6
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
Regards
04-21-2013 04:15 AM
Hallo Phillip,
I believe something doesn't add up here. To the best of my knowledge a vpc domain never comes up whithout a working peer keep-alive link, so either the keep-alive link was originally there and has been removed later for whatever reason, or the peer adjacency has never really formed. Before you change anything have a look at 'show vpc' and compare it with the output provided by steve-fuller.
Besides, a disruption of the peer keep-alive link does not explain interface flappings, crc errors or packet loss at all. All that points to a layer 1 problem. Check the wiring of e1/1, replace the transceivers on both sides and the fiber if neccessary, then clear the error counters and check again if they increase. That would be my highest priority here.
Regards
Pille
04-23-2013 08:18 PM
I ended up shutting the interface off that had all the errors and it corrected my issues. I traced it down to a bad X2 module in my 6513.
Thanks for all the help!
Phil
04-07-2017 02:29 AM
Hello, I know this thread is quite old.
But I need your help, I´ve configured two N5K and they're sharing the same vPC domain... and the peering Keep-alive is attached into mgmt0 in case of failure of one Keep-alive link all vPC(Port-channel) become unavailable.
Hope someone can assist in the right direction.
Regards,
Lucas Miguel
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide