Solved: Re: Cisco UCS/Nexus 802.1P tagging (VLAN0) on traffic between blades

Jonathan Bayless · ‎11-29-2018

I have a UCS system setup with 6248UP fabric interconnects (2 x 6248UP connected to a chassis with 2 x 2208xp fabric extenders), running the latest 4.x release on everything, it's basically a fresh setup. Nothing fancy, uplinked to a pair of Nexus 3064 switches.

I have 3 blades in it. They have Ubuntu 18 LTS on them and seem mostly fine. I can ping/ssh/whatever between them and it all works great. I need to PXE boot some virtual machines on them to run a variety of tasks so I have DNSmasq to run a DHCP server and PXE service. I try to PXE boot clients that use the iPXE client software and they always fail to get DHCP and PXE boot.

After some research I have found that the iPXE client doesn't like any traffic with 802.1q (VLAN) tagging in it (anything other than IPv4 really). After some network captures, we found that indeed the traffic coming from one blade to another is getting a VLAN tag.

You can see below the capture of a basic ping from one blade to the other. Incoming packets have the 802.1q (0x8100) ethertype encapsulation. Outgoing (reply) packets do not.

tcpdump: listening on enp6s0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:34:09.387714 fe:e6:cb:3a:c4:2d > 9a:19:21:3c:ba:41, ethertype 802.1Q (0x8100), length 102: vlan 0, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 49047, offset 0, flags [DF], proto ICMP (1), length 84)
    10.20.20.11 > 10.20.20.10: ICMP echo request, id 25376, seq 31, length 64
20:34:09.387796 9a:19:21:3c:ba:41 > fe:e6:cb:3a:c4:2d, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 60393, offset 0, flags [none], proto ICMP (1), length 84)
    10.20.20.10 > 10.20.20.11: ICMP echo reply, id 25376, seq 31, length 64

I then verified this happens with 2 new Centos installs as well and I am working on a Windows 2016 install as well.

From what I have learned, the UCS system puts all server ports in Trunk mode on its Nexus internal switch. That cannot be changed. I have the vNICs for my 3 blades configured for a single allowed VLAN and that is marked as the native VLAN. I have verified this on the CLI of the NXOS part of the fabric interconnect.

interface Vethernet2424
  description server 1/3, VNIC eth0
  switchport mode trunk
  no lldp transmit
  no lldp receive
  no pinning server sticky
  pinning server pinning-failure link-down
  switchport trunk native vlan 17
  switchport trunk allowed vlan 17
  bind interface port-channel1287 channel 2424
  no shutdown

interface Vethernet2426
  description server 1/2, VNIC eth0
  switchport mode trunk
  no lldp transmit
  no lldp receive
  no pinning server sticky
  pinning server pinning-failure link-down
  switchport trunk native vlan 17
  switchport trunk allowed vlan 17
  bind interface port-channel1286 channel 2426
  no shutdown

interface Vethernet2428
  description server 1/1, VNIC eth0
  switchport mode trunk
  no lldp transmit
  no lldp receive
  no pinning server sticky
  pinning server pinning-failure link-down
  switchport trunk native vlan 17
  switchport trunk allowed vlan 17
  bind interface port-channel1285 channel 2428
  no shutdown

I know that the ports in trunk mode will pass VLAN tags unmolested so that explains why an 802.1q tag with VLAN 0 is being allowed to hit the server operating system.

The question is where is the tag coming from in the first place? The 'trunk native vlan 17' configuration should mean all traffic is tagged as 17 on ingress from a server with that interface config and 17 should be stripped on delivery to the egress port for a server with the same config. Right?

So where is VLAN 0 coming from?

The only possible explaination I can find is basically this Cisco document that covers VLAN 0 Priority tagging. However, I have done nothing to configure or cause that tagging, that I know of.

Anyone have any thoughts? I have tried everything I can think of:

Changed native VLAN on both the fabric interconnects and upstream switches
Added more allowed VLANs on the vNICs
Rebuilt the vNIC config several times
Rebooted the servers
Different OS
Different MTU/QoS settings on the UCS Fabric Interconnects (including changing QoS policies for the server service profiles)
ICMP and other traffic types (ssh)
Used entirely different VLAN that only exists on the UCS system and not on the upstream switches

If this is by design, that somehow the system defaults to tagging traffic with VLAN 0 and then not removing that tag on egress, that seems like a strange mistake that could cause problems in a lot of ways.

mojafri · ‎12-01-2018

Hi @Jonathan Bayless,

You will find this behavior in all linux destro. This issue has been documented under- https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuu29425/?reffering_site=dumpcr

You may wanna try "net.bridge.bridge-nf-filter-vlan-tagged = 1" but I haven't tested it.

Please rate if you find it helpful.

Regards,

MJ

View solution in original post

mojafri · ‎12-01-2018

Hi @Jonathan Bayless,

You will find this behavior in all linux destro. This issue has been documented under- https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuu29425/?reffering_site=dumpcr

You may wanna try "net.bridge.bridge-nf-filter-vlan-tagged = 1" but I haven't tested it.

Please rate if you find it helpful.

Regards,

MJ

Jonathan Bayless · ‎12-01-2018

That is very helpful. Thanks so much for your information. I will check more into that. I thought it might be something like that because I had Centos 6 machines on UCS before and never had a problem like this.

Jonathan Bayless · ‎12-05-2018

I was able to do additional testing to verify. The issue happens with Centos 7 (up to the latest patches from included yum repos), Ubuntu 18 LTS, also latest updates.

It does not happen when using 2 Centos 6 systems. I installed Centos 6.10 and all updates on 2 blades and tested between them, no 802.1Q tags on the traffic so it is for sure a Linux bug either never really fixed or there was a regression.

I handed it up the chain to people at Ubuntu to look at and hopefully they will pursue it further.

Thanks