11-29-2018 01:26 PM
I have a UCS system setup with 6248UP fabric interconnects (2 x 6248UP connected to a chassis with 2 x 2208xp fabric extenders), running the latest 4.x release on everything, it's basically a fresh setup. Nothing fancy, uplinked to a pair of Nexus 3064 switches.
I have 3 blades in it. They have Ubuntu 18 LTS on them and seem mostly fine. I can ping/ssh/whatever between them and it all works great. I need to PXE boot some virtual machines on them to run a variety of tasks so I have DNSmasq to run a DHCP server and PXE service. I try to PXE boot clients that use the iPXE client software and they always fail to get DHCP and PXE boot.
After some research I have found that the iPXE client doesn't like any traffic with 802.1q (VLAN) tagging in it (anything other than IPv4 really). After some network captures, we found that indeed the traffic coming from one blade to another is getting a VLAN tag.
You can see below the capture of a basic ping from one blade to the other. Incoming packets have the 802.1q (0x8100) ethertype encapsulation. Outgoing (reply) packets do not.
tcpdump: listening on enp6s0, link-type EN10MB (Ethernet), capture size 262144 bytes 20:34:09.387714 fe:e6:cb:3a:c4:2d > 9a:19:21:3c:ba:41, ethertype 802.1Q (0x8100), length 102: vlan 0, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 49047, offset 0, flags [DF], proto ICMP (1), length 84) 10.20.20.11 > 10.20.20.10: ICMP echo request, id 25376, seq 31, length 64 20:34:09.387796 9a:19:21:3c:ba:41 > fe:e6:cb:3a:c4:2d, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 60393, offset 0, flags [none], proto ICMP (1), length 84) 10.20.20.10 > 10.20.20.11: ICMP echo reply, id 25376, seq 31, length 64
I then verified this happens with 2 new Centos installs as well and I am working on a Windows 2016 install as well.
From what I have learned, the UCS system puts all server ports in Trunk mode on its Nexus internal switch. That cannot be changed. I have the vNICs for my 3 blades configured for a single allowed VLAN and that is marked as the native VLAN. I have verified this on the CLI of the NXOS part of the fabric interconnect.
interface Vethernet2424 description server 1/3, VNIC eth0 switchport mode trunk no lldp transmit no lldp receive no pinning server sticky pinning server pinning-failure link-down switchport trunk native vlan 17 switchport trunk allowed vlan 17 bind interface port-channel1287 channel 2424 no shutdown interface Vethernet2426 description server 1/2, VNIC eth0 switchport mode trunk no lldp transmit no lldp receive no pinning server sticky pinning server pinning-failure link-down switchport trunk native vlan 17 switchport trunk allowed vlan 17 bind interface port-channel1286 channel 2426 no shutdown interface Vethernet2428 description server 1/1, VNIC eth0 switchport mode trunk no lldp transmit no lldp receive no pinning server sticky pinning server pinning-failure link-down switchport trunk native vlan 17 switchport trunk allowed vlan 17 bind interface port-channel1285 channel 2428 no shutdown
I know that the ports in trunk mode will pass VLAN tags unmolested so that explains why an 802.1q tag with VLAN 0 is being allowed to hit the server operating system.
The question is where is the tag coming from in the first place? The 'trunk native vlan 17' configuration should mean all traffic is tagged as 17 on ingress from a server with that interface config and 17 should be stripped on delivery to the egress port for a server with the same config. Right?
So where is VLAN 0 coming from?
The only possible explaination I can find is basically this Cisco document that covers VLAN 0 Priority tagging. However, I have done nothing to configure or cause that tagging, that I know of.
Anyone have any thoughts? I have tried everything I can think of:
If this is by design, that somehow the system defaults to tagging traffic with VLAN 0 and then not removing that tag on egress, that seems like a strange mistake that could cause problems in a lot of ways.
Solved! Go to Solution.
12-01-2018 06:12 AM
You will find this behavior in all linux destro. This issue has been documented under- https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuu29425/?reffering_site=dumpcr
You may wanna try "net.bridge.bridge-nf-filter-vlan-tagged = 1" but I haven't tested it.
Please rate if you find it helpful.
Regards,
MJ
12-01-2018 06:12 AM
You will find this behavior in all linux destro. This issue has been documented under- https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuu29425/?reffering_site=dumpcr
You may wanna try "net.bridge.bridge-nf-filter-vlan-tagged = 1" but I haven't tested it.
Please rate if you find it helpful.
Regards,
MJ
12-01-2018 10:31 AM
That is very helpful. Thanks so much for your information. I will check more into that. I thought it might be something like that because I had Centos 6 machines on UCS before and never had a problem like this.
12-05-2018 08:18 AM
I was able to do additional testing to verify. The issue happens with Centos 7 (up to the latest patches from included yum repos), Ubuntu 18 LTS, also latest updates.
It does not happen when using 2 Centos 6 systems. I installed Centos 6.10 and all updates on 2 blades and tested between them, no 802.1Q tags on the traffic so it is for sure a Linux bug either never really fixed or there was a regression.
I handed it up the chain to people at Ubuntu to look at and hopefully they will pursue it further.
Thanks
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide