Re: Cisco Nexus vPC Portchannel issues

motorbass · ‎06-27-2023

Hi ! I'm currently facing a network issue between Cisco Nexus 93108 (9.3.11) on an LACP port-channel configuration.

The replica below help us to reproduce the issue we had (using 4x Nexus 9000v qcow2 image version 9.3.9) as you can see my network topology is between 2 datacenters.

SWC001,SWC002,SWC011 and SWC012 are Nexus 9000v.
Switch1 , Switch2 Switch3, OP and OP1 are basic GNS3 switch

Configuration on each is similar : 1 Domain vPC between 2 switches (pretty basic 1 peerlink, 1 peerkeepalive, no layer 3, no peer gateway) and a WAN link through a vPC Po 100 between both datacenter that allow ALL vlans to transit.

vPC configuration is consistent and works well: SWC001 is directly link to SWC011 and SWC002 to SWC012, everything runs smoothly and got no issues. The thing is, in reality, there's an ISP between both Datacenters and the pain is coming... We only know from the ISP they use QinQ configuration in their own network, as a datacenter client we don't know which configuration neither devices they're using.

Also, before using Nexus, we had old HP core switch and we didn't set any particular configuration regarding to QinQ) After this HP=>Nexus migration, everything was fine except the PortChannel 100 status (so the extended LAN between both DCs)

To simulate an ISP in between, i set up a basic GNS3 switch configured with QinQ on e0 and e1 (VLAN 1 and Ethertype 0x88A8)

In my example Po100 is configured with only one physical interface (eth1/13), and so the current configuration on all eth1/13 on the 4 switch is :

version 9.3(9) Bios:version

interface Ethernet1/13
  lacp rate fast
  switchport mode trunk
  spanning-tree port type edge trunk
  spanning-tree bpdufilter enable
  channel-group 100 mode active

interface port-channel100
  switchport mode trunk
  spanning-tree port type edge trunk
  spanning-tree bpdufilter enable
  no lacp suspend-individual

If we're on a channel-group active/active configuration (after enabled feature lacp ), PortChannel protocol is LACP but it still goes on Switched and Down

SWC001# sh port-channel summary interface port-channel 100
Flags:  D - Down        P - Up in port-channel (members)
        I - Individual  H - Hot-standby (LACP only)
        s - Suspended   r - Module-removed
        b - BFD Session Wait
        S - Switched    R - Routed
        U - Up (port-channel)
        p - Up in delay-lacp mode (member)
        M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port-       Type     Protocol  Member Ports
      Channel
--------------------------------------------------------------------------------
100   Po100(SD)   Eth      LACP      Eth1/13(I)

If we set "channel group 100 mode on" on both side, Port CHannel protocol is none BUT interface is UP and Po Switched/Up

FRHD01SWC001(config-if)# sh port-channel summary interface port-channel 100
Flags:  D - Down        P - Up in port-channel (members)
        I - Individual  H - Hot-standby (LACP only)
        s - Suspended   r - Module-removed
        b - BFD Session Wait
        S - Switched    R - Routed
        U - Up (port-channel)
        p - Up in delay-lacp mode (member)
        M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port-       Type     Protocol  Member Ports
      Channel
--------------------------------------------------------------------------------
100   Po100(SU)   Eth      NONE      Eth1/13(P)

Now, if we have a look about lacp counters, from both side there's LACPDUs sent, but 0 received from each way.

SWC001# show lacp counters
NOTE: Clear lacp counters to get accurate statistics

------------------------------------------------------------------------------
                             LACPDUs                      Markers/Resp LACPDUs
Port              Sent                Recv                  Recv Sent  Pkts Err
------------------------------------------------------------------------------
port-channel1
Ethernet1/11       28                   24                     0      0    0


port-channel100
Ethernet1/13       12                   0                      0      0    0

Finally, when i have a look at sh vpc brief , Po100 status is down whereas consistency is success

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans
--    ----   ------ -------------------------------------------------
1     Po1    up     1,5,10-11,13-14,22-23,30,41,44,50-51,55,80-81,90,

                    100,110-114,120-121,130-135,140-141,150-154,

                    160-174,176,180,201-205,230,256-257,999


vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------

100   Po100         down*  success     success               -

We made different configuration on GNS3 to compare where the issue could be, active/active, active/passive or on/on , but we actually don't have any idea of the issue. We also have a look at the spanning tree part but it seems good. It seems that if anything is connected between 2x Nexus, it fails, even though link may be up (but there's no packet received)

Does anyone already have an issue like this ? with/without QinQ in between ?

(PS : i know GNS3 is an emulator but at the moment it's the best way to test configuration out of production, and it's a nice tool to show you great topology )

MHM Cisco World · ‎06-27-2023

if you remove the SW then PO100 interconnect two Sites DC1 and DC2 must be UP and port is "P"

motorbass · ‎06-27-2023

Hi

I didn't mentionned it but yes it works in that way, that was my first try in GNS3, unfortunately this is not the reality, as I wrote , at the time we have any device between DC Po won't go up

MHM Cisco World · ‎06-28-2023

Friend it not issue of vPC not GNS3,
the LACP need to see LACP frame from the same MAC, when you use two SW meaning you have two MAC, to make your network work you need to merge both SW to one virtual SW via VSS or stack.
or for Interconnect between DC ask ISP are they support mLAG, mLAG can merge two ISP SW/R to be virtual ONE and hence your PO between two sites UP and work.

motorbass · ‎06-28-2023

the LACP need to see LACP frame from the same MAC, when you use two SW meaning you have two MAC, to make your network work you need to merge both SW to one virtual SW via VSS or stack.

With our old HP core switch, configuration was basic (1x LACP with 2 interfaces as you can see in the screen) and everything worked fine. I mean, i'd like to know what could cause the behaviour differences between HP Core switches and Nexus , as the configuration is pretty simple (from an interface perspective)

(about DC and ISPs i'll keep that in mind but unfortunately i know they won't change anything as datacenter host many clients.)

MHM Cisco World · ‎06-28-2023

HP core work fine with same SP ?

show lacp count
show lacp neighbor
I need to see both in real network

motorbass · ‎06-28-2023

Yes, HP Core switch works well with the same service provider here, that's why we don't understand why we got an issue with Nexus. Here's a screenshot from HP core switch + another one about lacp neighbors and counters from Nexus (unfortunately from GNS3 at the moment..)

MHM Cisco World · ‎06-28-2023

for HP there is LACP send receive that why it work fine, the system-id in lacp I dont clear understand the photo you share but if the neighbor system-id is end with f0 00, then LACP see same neighbor in all interface.
for NSK in GNS3 you can see the partner system-id is 0 0 0 0 and the lacp counter is few send without receive any thing.
so as I mention before it is depend on SP.

NOTE:- in gsn3 try remove the SW and check lacp counter and lacp neighbor system-id

motorbass · ‎06-29-2023

Yes i agree , it works on HP cause both receive LACPDUs from each other, and that's the issue on Nexusv, they send but never receive LACPDUs. Still in GNS3 it works when you connect 2 nexus without any equipment, as well as it works with 2 real Nexus directly connected. However, i put a switch or a third nexus between both to check why they can't receive LACPDUs.

MHM Cisco World · ‎06-29-2023

NSK-1xSW-NSK <<- this what you run in your lab, this will never work SW in middle never bypass the LACP from one side to other.
NSK-1xCSR1000-NSK <<- this can work, if CSR1000 can work if we config bridge domain in CSR1000 (l2vpn) which make lacp bypass from one interface to other
NSK-2xCSR1000-NSK <<- this wok if both CSR1000 config with mLAG and with bridge domain.

you config two DC but you forget the SP how it config.
in real HP work because sure SP run l2vpn that bypass lacp from interface to other.

then what can I do in this case in my lab ?
1- use One SW between two DC
2- config LACP in SW for each side
2-A make sure DC and SW use same STP mode
2-B make sure the SW send receive lacp

this friend summary what you face