Server LACP SystemID change, Cisco VPC Port Suspension

brucesmithit · ‎05-25-2023

Good Day All,

Our server has an aggregation "externalaggr0" that runs LACP active with 2 members.

6c:fe:54:2:b9:21 and 6c:fe:54:4:5c:61.

LACP was using the MAC address of member 6c:fe:54:4:5c:61 as the SystemID... but we removed that interface from the aggregation and LACP so the aggregation and lacp start using 6c:fe:54:2:b9:21 as the SystemID. Interface 6c:fe:54:2:b9:21 stayed up and wasn't suspended.... I validated it was using this SystemID in a Packet capture. When 6c:fe:54:4:5c:61 was added back into the aggregation the Cisco Switch VPC suspended the switch port due to a lag-id mismatch.

Server:

externalaggr0 -- 10000Mb full up 6c:fe:54:2:b9:21 --
i40e3 10000Mb full up 6c:fe:54:2:b9:21 attached
i40e1 0Mb unknown down 6c:fe:54:4:5c:61 standby

Switch:

Name Type Local Value Peer Value
------------- ---- ---------------------- -----------------------
lag-id 1 [(1000, [(1000,
6c-fe-54-4-5c-61, 3e9, 6c-fe-54-2-b9-21, 3e9,
0, 0), (7f9b, 0, 0), (7f9b,
0-23-4-ee-be-b4, 82ac, 0-23-4-ee-be-b4, 82ac,
0, 0)] 0, 0)]

I'd like to understand if there is a way for us to configure our switchports so that if the systemid changes on the server side that we continue on and accept the new systemid, or if there is a command that can be ran to clear out the cached values? I figured this information would be learnt from LACP protocol.

From what I've read lag-id is generated via the following:
When creating a port channel, LACP does its due diligence and compares a few values. The first is the System ID. This is a unique value on each switch. It is a 64-bit value which can be manually assigned or automatically generated. Auto assignment comes from the system priority and the System MAC address. The second value is an identifier for the LAG, such as the port-channel ID. These four values (two at each end) create a LAGID (Link Aggregation Group Identifier). All ports in a port-channel must have the same LAGID. If there is a mismatch, the ports are excluded or the LAG stays down, depending on the implementation.

Thank You,

Bruce Smith

brucesmithit · ‎05-25-2023

In other we usually set a unique identifier for mlag / clag ... This being dynamic is cool, but if it doesn't update when LACP changes I feel like this s a bug ?

Reza Sharifi · ‎05-25-2023

Hi,

Can you post the Portchannel config from the switch?

HTH

MHM Cisco World · ‎05-26-2023

the System ID must be same for all member port of LACP PO
show lacp internal <<- give us some detail
NOW
I think the PO between Server and NSK use one link not two, i.e. the Server use teaming active/standby

brucesmithit · ‎05-26-2023

Thanks for the quick response,

I'm working with a network team internally to get information (separate teams), so pardon the latency.

They've sent me some screen shots "_".
Link Aggr Type: VPC
Attached Entity Profile: $name
CDP Policy: cdp-disabled
Link Level Policy: $name
LLDP Policy: enabed
Port Channel Policy: lacp-activ_int-pol

lacp-active_int-pol
Mode: LACP Active
Control: Fast Select Hot StandbyPorts, Graceful Convergence
Min: 1
Max: 16

template port-channel $Removed
vlan-domain member $Remved type phys
vlan-domain member $Removed type l3ext
switchport trunk allowed vlan $vlan tenant $tenant application $app epg $epg
... all vlans allowed here
channel-mode active
no lacp suspend-individual
spanning-tree bpdu-filter enable
spanning-tree bpdu-guard enable
speed 10G
media-type auto
exit

brucesmithit · ‎05-26-2023

Tried to have the image in the message, but attached it also. bad aspect ratio.

brucesmithit · ‎05-26-2023

Everyone,

I believe I may have found some useful information about this: (But maybe not) It looks like there a service that runs called CFS that is supposed to track and update these things, maybe there is a knob that can be adjusted so we don't get mismatches for lag-id

https://itnetworkingpros.files.wordpress.com/2014/04/vpc_best_practices_design_guide.pdf
Page 18-20)

Cisco Fabric Services (CFS) Protocol
Cisco Fabric Services (CFS) protocol provides reliable synchronization and consistency check mechanisms
between the 2 peer devices and runs on top of vPC peer-link. The protocol was first implemented on MDS products
(network storge devices) and then ported to NEXUS 7000.
Cisco Fabric Services (CFS) protocol performs the following functions:
- Configuration validation and comparison (consistency check)
- Synchronization of MAC addresses for vPC member ports
- vPC member port status advertisement
- Spanning Tree Protocol management
- Synchronization of HSRP and IGMP snooping

Checking vPC Configuration Consistency When You Build a vPC Domain
This section contains recommendations to help ensure that there are no incompatible parameters when building a
vPC domain.

Both switches in the vPC domain maintain distinct control planes. Cisco Fabric Services protocol will take care of
state synchronization between both peers (including the MAC address table, Internet Group Management Protocol
(IGMP) state, vPC states and so on.)
System configuration must be kept in sync. Currently this is a manual process (configuration is done separately on
each device) with an automated consistency check to help ensure correct network behavior.

There are two types of consistency checks:
- Type 1 - Puts peer device or interface into a suspended state to prevent invalid packet forwarding behavior.
With vPC Graceful Consistency check, suspension occurs only on the secondary peer device.
- Type 2 - Peer device or Interface still forward traffic. However they are subject to undesired packet
forwarding behavior.

brucesmithit · ‎05-26-2023

Ok,

Discovered something interesting Waiting to hear back from networking team, but maybe someone may already know the answer:
One of my servers receives LACP PDUs while its in its VPC lag-id mismatch which eventually leads to it learning the lacp systemid that matches and resolving the suspension.

The other server receives No LACP PDUs while its suspended.

I've requested the vPC, LACP, and Portchannel configurations for the switch ports they are both plugged into.

#Is getting PDUs after suspension
[root@ac-1f-6b-bb-d0-53 (us-east-4) ~]# dladm show-aggr -x
LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE
adminaggr0 -- 10000Mb full up ac:1f:6b:bb:d0:53 --
ixgbe3 10000Mb full up ac:1f:6b:bb:d0:53 attached
ixgbe1 10000Mb full up ac:1f:6b:f8:2b:e7 attached
externalaggr0 -- 10000Mb full up ac:1f:6b:bb:d0:52 --
ixgbe2 10000Mb full up ac:1f:6b:bb:d0:52 attached
ixgbe0 10000Mb full up ac:1f:6b:f8:2b:e6 attached
[root@ac-1f-6b-bb-d0-53 (us-east-4) ~]# dladm remove-aggr adminaggr0 -l ixgbe3
[root@ac-1f-6b-bb-d0-53 (us-east-4) ~]# dladm show-phys
LINK MEDIA STATE SPEED DUPLEX DEVICE
ixgbe0 Ethernet up 10000 full ixgbe0
ixgbe2 Ethernet up 10000 full ixgbe2
ixgbe1 Ethernet up 10000 full ixgbe1
ixgbe3 Ethernet down 0 unknown ixgbe3
(reverse-i-search)`pcap': snoop -r -d ixgbe3 -o /var/tmp/woot.^Cap ethertype 0x8809
(reverse-i-search)`ix': dladm remove-aggr adminaggr0 -l ^Cgbe3
[root@ac-1f-6b-bb-d0-53 (us-east-4) ~]# snoop -r -d ixgbe3 -o /var/tmp/ixgbe3-woot.pcap ethertype 0x8809
Using device ixgbe3 (promiscuous mode)
10 ^C

#Is not getting PDUs after being suspended
[root@3c-ec-ef-d8-89-68 (us-east-4) ~]# dladm show-aggr -x
LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE
adminaggr0 -- 25000Mb full up 3c:ec:ef:d8:ba:aa --
i40e0 25000Mb full up 3c:ec:ef:d8:ba:aa attached
i40e2 25000Mb full up 3c:ec:ef:d8:89:68 attached
storageaggr0 -- 100000Mb full up e8:eb:d3:a7:ab:da --
mlxcx0 100000Mb full up e8:eb:d3:a7:ab:da attached
mlxcx1 100000Mb full up e8:eb:d3:a7:ab:db attached
externalaggr0 -- 25000Mb full up 3c:ec:ef:d8:ba:ab --
i40e1 25000Mb full up 3c:ec:ef:d8:ba:ab attached
i40e3 25000Mb full up 3c:ec:ef:d8:89:69 attached
[root@3c-ec-ef-d8-89-68 (us-east-4) ~]# dladm remove-aggr adminaggr0 -l i40e0
[root@3c-ec-ef-d8-89-68 (us-east-4) ~]# ^C
[root@3c-ec-ef-d8-89-68 (us-east-4) ~]#
[root@3c-ec-ef-d8-89-68 (us-east-4) ~]# dladm show-phys
LINK MEDIA STATE SPEED DUPLEX DEVICE
i40e2 Ethernet up 25000 full i40e2
i40e0 Ethernet down 0 unknown i40e0
i40e3 Ethernet up 25000 full i40e3
i40e1 Ethernet up 25000 full i40e1
mlxcx0 Ethernet up 100000 full mlxcx0
mlxcx1 Ethernet up 100000 full mlxcx1
[root@3c-ec-ef-d8-89-68 (us-east-4) ~]# snoop -r -d i40e0 -o /var/tmp/wooti40e.pcap ethertype 0x8809
Using device i40e0 (promiscuous mode)
0 ^C

MHM Cisco World · ‎05-27-2023

this NSK3 connect to NSK1 and NSK2 with two PO
NOW you can see that NSK show lacp neighbor the NSK3 see only one MAC address as system-ID 00-23-4-ee-be-1

Screenshot (723).png

same system-mac of vpc role 00:23:04:ee:be:01

Screenshot (724).png

so seeing two system-id meaning there is issue in NSK PO side