Re: Cannot get VLAN Working SG500 (L2) with UniFi USG (L3) - Page 2

shelzmike · ‎07-27-2020

As we can all probably relate, I am at my wits end at this point with this setup. It is in a home lab, so nothing major, but it should be working. I cannot tell where my problem lies. I am quite familiar with Cisco, but mainly Catalyst and haven't really ever worked with an SG500 before, but it doesn't seem that much different. I am quite new to the Unifi platform though. Because of that, I am not sure where my problem is.

Since the Unifi USG handles L3 routing pretty darn efficiently and by default with a network and VLAN creation, I am using the SG500 in L2 mode for simplicity.

Currently most everything is on the default VLAN 1. What I want to do is create a VLAN that I plan on using to test PiHole. Since I don't want every device to use it as its DNS and I eventually want the PiHole to also handle DHCP for that VLAN subnet, I am creating a separate VLAN for it. (Note, currently, for initial setup, I have the PiHole set to a static IP, and the Unifi SG is handling DHCP.

Here is the topology:

Unifi:

VLAN 1 - 192.168.10.0/24

GW - 192.168.10.5

VLAN 172 - 172.16.254.0/26

GW - 172.16.254.1

(I currently also have a hidden SSID that is on the VLAN ID 172 as well).

SG500:

(Note, I have other VLANs created the same way that I don't mention in the post. This is because these are in the same boat and will be used for other things later. Once I get this one figured out, those will be setup the same way)

The client is on Gi1/9

The USG is on Gi1/11

The APs are on Gi1/12 and Gi1/20

switch01#show running-config

config-file-header
switch01
v1.4.8.6 / R800_NIK_1_4_202_008
CLI v1.0
set system mode switch queues-mode 4 

file SSD indicator encrypted
@
ssd-control-start
ssd config
ssd file passphrase control unrestricted
no ssd file integrity control

!
vlan database
vlan 2,172,192,2092
exit
voice vlan oui-table add 0001e3 Siemens_AG_phone________
voice vlan oui-table add 00036b Cisco_phone_____________
voice vlan oui-table add 00096e Avaya___________________
voice vlan oui-table add 000fe2 H3C_Aolynk______________
voice vlan oui-table add 0060b9 Philips_and_NEC_AG_phone
[0mMore: <space>,  Quit: q or CTRL+Z, One line: <return> 
                                                      
voice vlan oui-table add 00d01e Pingtel_phone___________
voice vlan oui-table add 00e075 Polycom/Veritel_phone___
voice vlan oui-table add 00e0bb 3Com_phone______________
ip dhcp relay enable
hostname switch01
line console
exec-timeout 30
exit
line ssh
exec-timeout 30
exit
line telnet
exec-timeout 30
exit
logging host 192.168.10.10 severity debugging
logging buffered debugging
logging origin-id ip
username admin password encrypted 580078003fd4025c65e privilege 15
username cisco password encrypted 8f0eeb8542065ada069 privilege 15
ip ssh server
ip ssh password-auth
ip ssh pubkey-auth auto-login
                                                      
crypto key pubkey-chain ssh
user-key admin rsa
key-string row AAAAB3NzaC1yc2EAAAABJQAAAQEA
key-string row evmMERjGzDKMdXR0OMlaRItT40Rn
key-string row Qtdn0CUaFVHa8coRECpa/xrYWMrY
key-string row OBY667ReQoTi0RAWWEO7HZfu
key-string row oLWmmlvc1GVYekwjk5JSvgpxxbfP
key-string row 0wieTa0pCFVyuOIkuJmL4d++V9uN
key-string row vgJSzRLcZaZQCc+Pq8zcFbG9Z34G
key-string row LwHTaGgNSvOUa41l1qyzhjSCyg7g
key-string row ky1CqaPEpAbtnCxz076SNFG79gDj
key-string row jnQykPVxkTbmU2yydw==
exit
exit
snmp-server server
snmp-server location Haer
snmp-server contact "Mike"
snmp-server community int_m ro 192.168.10.10 view Default
ip http timeout-policy 1800
clock timezone " " 0 minutes 0
clock summer-time web recurring usa
!
                                                      
interface vlan 1
 ip address 192.168.10.2 255.255.255.0
 no ip address dhcp
 no ipv6 address autoconfig
 no ipv6 unreachables
 no ipv6 dhcp client stateless
!
interface vlan 2
 name netman
 shutdown
!
interface vlan 172
 name pihole
!
interface vlan 192
 name primary-lan
 shutdown
!
interface vlan 2092
 name iot
 shutdown
!
                                                      
interface gigabitethernet1/9
 description RaspberryPi
 switchport mode access
 switchport access vlan 172
!
interface gigabitethernet1/11
 switchport trunk allowed vlan add 2,172,192,2092
!
interface gigabitethernet1/12
 switchport trunk allowed vlan add 2,172,192,2092
!
interface gigabitethernet1/20
 switchport trunk allowed vlan add 2,172,192,2092
!
exit

ip dhcp snooping
switch01#

The problem I am having is as follows. When I connect a laptop to the SSID with VLAN 172, I get an IP Address from the USG. I can ping both the 172.16.254.1 and 192.168.10.5 addresses (by default, USG is all inter-vlan allowed..will change that later)

I cannot ping the PiHole client from the laptop, nor can I ping it from the USG.

I cannot ping anything from the PiHole client, not even the 172.16.254.1 address.

From the switch I am able to ping 192.168.10.5 but cannot ping 172.16.254.1 (immediately get PING: net-unreachable).

I believe my problem is at the link between the USG and the switch. However, my frustration is that I do not know where the problem lies - if it is on the switch side or the USG side. While I'd appreciate an actual solution, if, at a minimum someone can at least confirm, yes, my switch config is correct and should be working, then I think I will at least know my problem is at the USG.

Thanks!

shelzmike · ‎07-28-2020

Was on a very long, very thorough, support call with Unifi today and am now thinking we have sort of definitively came to the conclusion that the problem is not the USG and is more likely the SG500. What, I have no idea still (yet), but at least it helps some.

My USG is DHCP for VLAN 172, when I plug a client into a hard wire port that is access port VLAN 172, it actually picks up a DHCP address, and I see this traffic traverse the trunk port. However, once I get the address, I am unable to ping the gateway of 172.16.254.1, I see it going out the trunk port, but no response is coming back in.

Additionally, I am unable to ping from the USG into the client PC (wired or wireless). Yes, this makes no sense. USG crosses the trunk with DHCP traffic, but with no other traffic (no firewall rules on USG are blocking this..it is set to allow all both ways basically and this is done by default).

I am also unable to ping between 2 wired clients on the switch, both VLAN 172 access ports. I am going to assume this is because there is no arp entry on the switch and since it is L2 and the USG is L3, all of this traffic has to traverse the trunk, even though it is on the same switch (this is correct, right?).

The arp that the switch knows shows no VLAN 172 addresses:

Matter of fact, ARP doesn't show but 3 entries:

 VLAN    Interface     IP address        HW address          status
--------------------- --------------- ------------------- ---------------
vlan 1                192.168.10.10   00:0a:f7:0a:78:36   dynamic
vlan 1                192.168.10.22   88:51:fb:5e:51:3b   dynamic
vlan 1                192.168.10.130  44:85:00:4d:49:37   dynamic

10.10 is a static server, as it 10.22. The 10.130 is my personal laptop I do all of my work on.

I see the RPi looking for 172.16.264.1, but never getting an answer, so there seems to be something that is stopping the traffic from coming back into the switch at the trunk port to the USG. I think this may be a switchport trunk config issue. Still researching it.

I am determined now. This won't beat me :)

Richard Burts · ‎07-29-2020

The easy part to respond to is the part about arp. The arp table on your switch would only contain entries for devices that have communicated with the management interface of the switch. The switch forwards traffic using layer 2 addressing and does not need arp entries to do this. So there is no reason why there would be arp entries for vlan 172.

There is a lot to think about here and I might like to start with your comment that you connected 2 devices in vlan 172, that they did get IP addresses but were not able to ping each other. Is there any possibility that the devices were running local firewall or had other security policy that did not allow ping? Can you test again? Try the ping both ways (each device originating ping and being the destination of ping) and then immediately show the arp table of the devices? Do they have arp entries for each other?

I also wonder about disabling dhcp snooping on the switch and whether that might change the behavior.

HTH

Rick

shelzmike · ‎07-29-2020

The clients did not have any local firewalls enabled. There wouldn't be any issues related to that.

Turned off snooping, didn't change a thing.

I should be getting my USB Ethernet dongle today so that I can do packet sniffing on a linux machine so I can be sure I can see the 802.1Q tagging.

I am leaning toward 2 things at this point, either some sort of tagging issue or a bug (either in the switch or, possibly, the USG).

I tested last night with a completely newly created VLAN, different subnet that was a more standardized 192.168.0.0/26 and got the same exact results, so I know the problem isn't with that.

Once I get to be able to the the VLAN tags (or absence of them), I should have more info to report. Thanks again for the assistance!

PH42420 · ‎07-29-2020

I totally feel your pain. When I was going through this, possibly similar issue, I was also not able to diagnose the problem using port mirroring.

I was very confused as to why I was able to ping my edge router, access all it's interfaces, etc. from all sub-nets on the SG500, but was not able to receive any packets back from the internet. This lead me to suspect, highly, that the adjacency was somehow incomplete or perhaps only L2 in nature, somehow. I rarely, if ever, run into situations where something "partly" works. I'm use to a given type of operation succeeding or failing and there being essentially no instances of "partial compatibility" when it comes to these kinds of things.

Back then, this suggested to me that the router itself, by tagging or by some other constraint, was not being allowed to fully operate on the ISP's network.

This is all speculation, of course, because I was never able to get confirmation from ATT as to why everything began working, one day, miraculously. I asked several times so I would not go through the headache again and was never given a response.

Please let keep us posted on how things go. I may decide to use that router again and I'd like to know more about potential resolution items.

I, too, disabled snooping, all non-required multi-casting (ISP specific was left enabled), but did implement fire-walling via "default block" rules which blocked all inter-vlan routing on the SG500 excluding the LAN of the edge router (192.168.1.0/24), to which the Cisco device had an ip address. I blocked all other unnecessary protocols everywhere as far as I could identify them. I wish I could access that config to give better information... For the sake of simplicity, I allowed TCP and UDP only, as well as outgoing ports 53, 123, 80, 443, 500, 4500, and 8080 for basic internet access, host based VPN, and alt-http for some antivirus software I was running at the time.

shelzmike · ‎07-29-2020

OK. Now we are getting a little further in terms of troubleshooting. Not any resolution yet, but at least I know more information.

I finally got my USB-Ethernet adapter and set up port mirroring on my linux machine. NOW, I can see tagging going on and I have discovered that tagging IS INDEED being passed around, including from the USG.

In summary, the pings from the USG to the switch on its trunk do get passed with the 172 tag, but fail. I see no response from the client it is trying to ping, so it can be assumed that the packet goes no further than that trunk port as far as I can tell.

(key - 172.16.254.1 (USG), 172.16.254.25 (wireless client on VLAN 172), 172.16.254.10 (wired client on VLAN 172)

This is the traffic on the switch trunk port, with the filters of icmp && ip.addr==172.16.254.1 (or whatever IP address I am looking at) applied.

So you can see, the switch doesn't seem to direct the traffic properly after it's trunk.

However, when I ping from the client on VLAN 172 to the USG, traffic not only passes from the client, through the trunk port to the USG, IT COMES BACK, tagged appropriately.

From client to USG:

And back again (response from USG to client)

How does that even jive with the first capture screenshot above! Clearly traffic can get from the USG to the access ports on the switch through the trunk port, tagged at VLAN 172 (I have a theory, down below).

Here is something weird also. When I ping from the wireless VLAN 172 client to the wired VLAN 172 client, the only traffic I see on the switchport trunk is the response from the gateway back to the wireless client, but nothing else. (Note, the ping was from 172.16.254.54 to 172.16.254.10). I don't know if this is expected behavior (i.e. why doesn't the traffic go through the switch trunk port to the gateway, but this may be due to the Unifi infrastructure. Both the USG and the APs are managed under the UniFi controller, so perhaps there is something working on the back end with the controller. Not sure about this one)

Here is the output from a wired 172 client to the wireless 172 client

When I try to ping from wired client to wired client, I get nothing at all on the trunk port. Additionally, even when I drill down and monitor JUST the port the wired client that is the source of the ping, I get no traffic at all. I do get traffic on a successful ping to the gateway of 172.16.254.1 however .

One of my theories (besides the switch just being screwed up, or has a major bug), is that the arp tables are causing problems, which reside on the USG and I think that is because the USG can't get to the switch. All of my 172. entries are listed as incomplete.

The reason I say this, is that the arp entries on the USG are all listed as incomplete:

However, that is a thin theory at best, because when I inspect the dst portion of the packet of the failed USG to wireless client, it has the proper MAC address.

So, what I have sort of proven, I think, is that the packets are tagged, but don't make it from the USG side to the switch side, when initiated from the USG side. However, the packets are also tagged but DO make it from the USG side to the switch side, if the traffic is initiated from the switch side.

This is a tough one for sure. I have more information but less answers!

EDIT: After further testing this evening, I have discovered the same problems exist with any VLAN. I tested using 2 other different VLANs (using different subnets) and I get the same exact results, which are:

Traffic flows from VLAN to untagged (native) subnet, it does not flow the opposite way. Anything that is tagged does not seem to be working at all.

Richard Burts · ‎07-30-2020

There are several things to discuss. First let me say a few things about the arp entries. First I think it may be significant that the arp entries on USG for the 172.16.254 network are incomplete. The USG would need the arp entry for any traffic initiated from USG to a host in that network. Also the USG would need that arp entry for any traffic initiating from a different vlan (different subnet) to a host in that network. The incomplete entries sure do look like there is some kind of problem. Second let me explain that we do not need the ESG arp entries (and in fact we should not need the USG at all) when a host in that network attempts to communicate with another host in that network. When 2 hosts in the same vlan (same subnet) want to communicate they should just arp for each other and then communicate directly. The switch would do layer 2 forwarding of the traffic and it should not need to go over the trunk at all.

I am interested in what I think I understand about 2 devices in the same subnet not being able to ping each other. I would like to suggest a test to investigate this. Would you get 2 devices connected in the same wired vlan (perhaps something different from 172 so that wireless is not part of the environment), would they receive IP addresses using DHCP? Have each device attempt to ping the other. Then immediately show the content of the arp table on each device and post it. Also would you post the output of ipconfig (or whatever is the appropriate command if these are not Windows machines).

HTH

Rick

shelzmike · ‎07-30-2020

EDIT: You can certainly read through all of my replies below to see the madness that I descended into, but if you want to know what I found the problem was and hopefully get a good laugh, jump to the last reply.

I agree that I think the arp is significant because the underlying problem can be summed in in that "the traffic cant get to its destination because the destination is unknown" Since routes look correct, ARP entries are the only other real decision point. Based on the response from the ICMP, it seems to be getting to the network just fine (so routing is fine), but it can't get to the host on that network...well, because it doesn't know where the heck it is at.

Here is something I came up with on a tcpdump of the USG on the VLAN 172 interface. From it, I gather 2 things:

1.) It does not appear that the USG is responding to the ARP request that the RPi (wired 172 client) is so desperately asking for. As you can see below, the request comes into the USG, but it never goes back out with information.

2.) This arp request has no tag, even though it is on an access port for 172 and it traverses the tagged trunk from the switch.

Granted, I am assuming the tcpdump is including that information since the USG is running linux kernel, though unsure if there is some sort of setting to turn that off, but I bet not. Although, I do see it tagged on the trunk port of the switch going out. So, since the USG port is effectively a tagged port, when I see he traffic come in, I should still see the tag, no?

The above is the USG dump, I failed to screenshot the frame info below it, showing no tag, but believe me when I say it isn't there.

Here is the same traffic on the switchport trunk, and it is tagged.

Just when I was about ready to throw in the towel, new info came up! I will try your suggestion. While I am back and forth, I now am thinking that the problem is in the USG, some sort of bug. I may try rolling back the firmware at some point to see if that solves anything. Sadly, I updated the firmware just before implementing the VLANs, so no clue if the same problem existed before the update. I will also look to make sure the USG should be capturing TAG info.

shelzmike · ‎07-30-2020

So new info. Either the USG isn't displaying the 802.1q tagging or it is somehow stripped before the packet is capture (which I don't understand why it would do that or how).

I setup a mirror both on the switch trunk port and also ran a USG tcpdump in tandem.

I pinged from the USG (172.16.254.1) to a wireless client (a new one, which is my mobile phone, 172.16.254.26). First interesting thing, this responded to ping appropriately (and is on a DIFFERENT AP than my laptop hooks to, which may be something). As you can see, the traffic on the trunk port shows tagged, as expected:

On the USG, it shows the successful frames, but does not have them tagged

So, I ALSO pinged from the USG to the wired client (172.16.254.10). It does not show up on the switchport trunk capture, but here is the kicker - IT DOESN"T SHOW UP ON THE USG PORT EITHER!

This tells me, the USG is thinking, OK, Ping this address, check ARP for where it is, (arp says incomplete), has no idea where to send it, so doesn't send it out of the USG port. This is significant, I believe. At least in narrowing down where the problem might lie. I wonder if I can statically add an ARP entry, but that still may not help. I know I can on Windows (which didn't help, tried yesterday), but maybe doing so on the USG might do something. At least so we could say once and for all, it is for sure related to ARP.

There is not an entry at all in the USG for this wired device

shelzmike · ‎07-30-2020

OK, so I was able to add a static arp entry (the process of which is super strange on the Unifi ecosystem, have to add a custom json file on the controller and reprovision the gateway). I verified it was there. Well, when I tried to ping it this time, from the USG, it actually passed traffic! BUT only from the USG port to the switchport, but not back. So, before, I wasnt getting any traffic even on the USG port because the ARP was missing, but now, with the arp forced, it travels to the switch (and is indeed tagged at the switch), but doesn't make it back.

2 steps forward, 3 steps back!

shelzmike · ‎07-30-2020

So I tried your suggested test.

I have a completely different VLAN (ID 2). Trunks are passing. USG doing DHCP. I have 2 fixed wire ports, both access VLAN 2.

Connected, and immediately received IP addresses for both. 10.100.100.6 and 10.100.100.7.

When I ping from 10.100.100.6 to 10.100.100.7, I get replies appropriately.

However, when I ping from 10.100.100.7 to 10.100.100.6, I get the same unreachable error! I do not see this traffic on the trunk port of the switch, which as you mentioned, is expected.

Here is the arp entries in windows:

That is from the 10.100.100.7 pc. I am not going to include the other PC because it would be a pain to do so, but it looks basically the same. It has the entry for 10.100.100.7.

I also completely turned off the APs just to be sure they weren't somehow interfering, made no difference.

I feel I may have exhausted all options at this point. Everything seems so random, in a specific way, if that makes sense.

While I go back and forth with where the problem lies, I am not back to the issue being on the switch somehow.

Strangely, I get the following entries on the switchport trunk, I am not initiating this traffic.

The 10.100.100.1 is the USG.

The really weird thing is that if I actually initiate a ping from the gateway to the 10.100.100.7 client, it is successful. If I ping from 10.100.100.6 to the USG on 10.100.100.1, it is successful, but if I ping 10.100.100.6 from the USG, it fails!

shelzmike · ‎07-30-2020

I keep finding this message in the wireshark logs and I haven't quite figured it out. Meaning, I basically know what it is, but not sure why it is happening. Not sure if it is related to an actual issue, or if the issue is because the USG doesn't actually have a fully formed ARP entry for this (wire VLAN 172) client.

I do see ARP broadcasts flowing from the client looking for the USG and I also see ARP broadcasts coming from the USG asking for the address of the client, but never see any answers either way.

That source is the USG and the client is just Linux server on my network, but it is going across VLAN 1.

shelzmike · ‎07-30-2020

Well, I figured it out (well, part of it at least, the biggest part. Haven't done other testing yet as there were a few other weird things on other VLANs).

For VLAN 172:

You know that saying, "Can't see the forest for the trees"? Well, that applies here and looking back through my screenshots it is GLARING now, but I didn't check the simplest thing. Actually, somehow I DID recheck this and STILL didn't see it. I can't believe how stupidly simple the issue was. The martian alerts is what caused me to refocus and check again.....here we go...ready...?

The RPi address was misconfigured!!!

It had the correct mask and gateway, but the WRONG address. Instead of 172.16.254.10, I had it set to 117.16.254.10. That 7 sure is a powerful deterrent to actually seeing those trees. Side by side, it is obvious, but in the midst of walls of text, it completely fooled me, every single time.

That being said, all is not lost. I really sharpened my troubleshooting and packet capture skills as well as learned about some pretty deep nuanced networking theory.Not wasted time..indeed the opposite and I hopefully will remember this and never overlook this again in the future!

I still think there was some strange things I saw on other VLANs, but I will check on those later.

Thanks again for the assistance. It is much appreciated!

Richard Burts · ‎07-31-2020

Thanks for the update. Glad to know that you figured out the main issue. Interesting that the main problem turned out to be a misconfigured address. When you did the post mentioning martian address I asked about the 117 address and it perhaps being on vlan 172. Wished we had looked at that a bit more closely. But out investigation did cover some good things, and it was a good learning experience. There were a few other things that went by in the discussion, like some other devices that could not ping in both directions (or at all). But probably not worth chasing them at this point. So congratulations on finding the solution to your own problem. And a well deserved +5 for describing the solution for other participants in the community. This is helpful.

HTH

Rick

shelzmike · ‎07-31-2020

Yeah, it literally was a brain block. All of the messages that related to the RPi all have 117, clear as day now, but I literally could not see it! The contributing factor, I think, was the gateway being correct, but address being wrong. The gateway being correct meant traffic was flowing, or trying to, causing these errors. Lesson learned on that one. This was all on my home network and at work, I am a corporate enterprise infrastructure lead for a global 500 and have been in the biz for 20 years at this point and I think sometimes, when you work at such a level all the time, the little things get jumped right past, causing the biggest (and most easily solvable) problems!