Solved: Re: DHCP not working on default VLAN1 but works on other VLANS - Page 4

mlord · ‎07-28-2023

Hello,

I've spent about 2 weeks reading threads on this subject but I've yet to find what solves my issue. We recently have an issue where our VLAN1 stopped doing DHCP, well, not entirely. Sometimes it takes a long time for an IP to be issued, or not at all. And sometimes it issues an IP that's already in use even though most of the IP addresses are available. The configurations we're using have been in use for years; we've had this problem maybe twice in the past but a core stack reboot would typically clear it up. We deleted exclusions (bad idea) as I assumed having reservations was sort of the same thing; they've been re-added so no more conflicts. My laptop is connected to a small switch, which is then connected to our Admin Core stack of 3 9300-48T's (16.8.1a). An ipconfig /renew does nothing but return an error it couldn't find our DHCP server. I'm trying to get an IP from our scope (192.168.0.1-254/24) from default VLAN1 (192.168.0.150) to our DHCP server VM (10.1.90.3). The DHCP Server VM is on a VMWare machine; this VMWare machine has two Nics occupied and each are a trunk with all VLAN's. Our Sonicwall Firewall is 192.168.0.1, our switch stack is 192.168.0.150 (if it even should be) and our DHCP server vm is 10.1.90.3.

I'm sure both switch stacks don't have matching configurations; these were set up prior to me and so with reverse engineering how these were set up, plus not being fully educated on cisco switches and managing them this has been an uphill climb. I appreciate any and all questions and insights. Thank you.

Peter Paluch · ‎08-02-2023

Hi @mlord ,

Hmmm... The memory buffer size for the capture was allocated to be 50 MB. It is possible that if enough time elapsed after starting the capture, the buffer got filled with data if the switch is hammered by traffic, and the capture stopped automatically. Try exporting whatever is in the buffer into a file and let's see if this theory holds water.

Thanks!

Best regards,
Peter

mlord · ‎08-02-2023

@Peter Paluch

Okay I got them finally. I finally saw some DHCP packets populate the DHCP Server's Wireshark, but I can't confirm that the switch capture was running, or stopped automatically before those packets appeared on the DHCP Server. I've attached what I have here as a .zip.

Peter Paluch · ‎08-02-2023

Hi @mlord ,

So, this capture proved most insightful!

Indeed, the PCAP from the switch was filled with 50MB of transit data within mere 0.2 seconds. Such amount of traffic hitting the CPU is absolutely abnormal; a switch should not be handling transit traffic at all in the CPU.

It took me a while to understand why this was happening: The hosts we see in the PCAP from the switch are from VLAN1 and its address space 192.168.0.0/24 and obviously they are using the switch - the 192.168.0.150 - as their default gateway. However, the switch itself has its default route configured to point to 192.168.0.1. Hence, the traffic from hosts in VLAN1 and destined to internet is getting "tromboned" or "hairpinned" - it comes to the switch through VLAN1 and the switch needs to send it out through VLAN1 again to 192.168.0.1.

Now, IP rules dictate: Whenever you need to route a packet out the same routed interface that you received it on, you must also send an ICMP Redirect to the sender to inform it that it can save one hop by using the proper gateway directly.

And this is the problem! With hardware-based switches, ICMP Redirects are generated in the operating system of the switch, not in its switching/routing hardware, and so for the switch to be able to generate those redirects, the hairpinned traffic must be punted to the CPU for the operating system to see it and generate the ICMP Redirects. This is why the traffic from VLAN1 exiting toward internet is badly hammering your CPU on the switch, causing it to not have time to process more useful traffic in VLAN1 - like your DHCP clients, for example.

It's clear that the end hosts do not honor the ICMP Redirect messages originated by the switch because they are not changing their routing - they're still blasting the traffic across the switch. Ignoring ICMP Redirect messages is the usual behavior of today's operating systems, so no surprise here. Hence, while the proper solution would be to fix the routing on the hosts inside VLAN1, it may not be easily possible, and as a quick interim fix, we can disable generating the ICMP Redirects in VLAN1. This will prevent the switch from punting the traffic to the CPU.

The interim fix is easy:

configure terminal
interface Vlan1
 no ip redirects
end

Please try to configure this and then try getting an IP address from your laptop in VLAN through DHCP again.

Best regards,
Peter

mlord · ‎08-02-2023

@Peter Paluch

Yep, instant DHCP. That worked Peter. I had eye-balled that line too and didn't think much of it. We're a small business; few employees and mostly servers. I'm not sure what you mean by "routing on the hosts inside VLAN1", but I'm curious if that involves just not using VLAN1 for anything other than the switches and the Sonicwall? I do honestly appreciate the time you dedicated to this.

Peter Paluch · ‎08-02-2023

Hi @mlord ,

I am so glad to hear that!

I'm not sure what you mean by "routing on the hosts inside VLAN1"

Ah, right. We disabled the ICMP Redirects but the hosts in VLAN1 still route traffic through the switch and switch passes it on - still in VLAN1 - to the firewall.

I am not sure if it is possible to configure those hosts in VLAN1 to use the SonicWall as their default gateway instead of the switch. This would save one unnecessary hop and prevent the traffic from being hairpinned through the switch.

And I suppose this configuration change is not possible because for the hosts in VLAN1, some destinations are reachable through the SonicWall and another destinations are reachable through the switch, so there is no single gateway in VLAN1 serving all possible destinations without causing the hairpinning. This is because you have two gateways in VLAN1 (the firewall and the switch), and each of them leads to different destinations.

As you mentioned, one possible solution would be to leave the VLAN1 purely for the switch and the firewall, and move out all other hosts from VLAN1 to a dedicated VLAN routed by the switch alone. This would incur some readdressing and some additional routing since the firewall would need to have a route toward that new VLAN, pointing to the switch across VLAN1.

The current setup is not tragic, by the way - since we've disabled those ICMP Redirects, the hairpinned traffic is just like any other for the switch. If the extra hop is not an issue for you, it's not an issue for the switch now, either.

You are very cordially welcome!

Best regards,
Peter

mlord · ‎08-02-2023

@Peter Paluch

Thank you, that all makes sense. You've helped me learn much more than I thought I would. One last question; due to the situation with VLAN1, is it possible the Sonicwall then is receiving extra traffic that it shouldn't be? We have all these other VLANs for a reason so, if so, I'll start moving hosts that just do not need to be on VLAN1.

Peter Paluch · ‎08-02-2023

Hello @mlord ,

It's my sincere pleasure.

due to the situation with VLAN1, is it possible the Sonicwall then is receiving extra traffic that it shouldn't be?

It is certainly receiving broadcasts in VLAN1 it has no use for, such as the DHCP client messages. But aside from that type of traffic, I don't think it is receiving any other significant traffic. If the switch sends a traffic through the firewall, it is because the routing table on the switch says so. So aside from some flooded traffic in VLAN1 (and we can discuss whether that is of any concern), I don't believe the firewall is receiving any significant traffic it is not supposed to.

Best regards,
Peter

mlord · ‎08-02-2023

@Peter Paluch

We have had high Data Plane usage issues randomly; we've noticed and Sonicwall support has noticed that we have a large amount of UDP traffic that is measured in the Sonicwall. Sometimes a reboot of the Primary Sonicwall clears it up and we won't see it for weeks. We have parters that monitor video streams and if we get an email, we know it's due to the Sonicwall having a high data plane usage. We do a majority of video streaming over SRT though which is UDP.

Peter Paluch · ‎08-02-2023

Hello @mlord ,

I understand.

I do not believe, however, that your SonicWall is hit by some random trafic that only arbitrarily finds its way to the firewall. Traffic that arrives at the SonicWall is either flooded in VLAN1, or it is sent to it on purpose - meaning that it is a device using SonicWall as its routed next hop.

Your AdminStach has its default route pointing to the SonicWall, so anytime it receives an internet-bound traffic from its VLANs, it will forward it to the SonicWall. But this is not a random, transient process - your AdminStack is statically configured to point to SonicWall and treats all internet-bound traffic this way.

So I don't think that the current routing could alone explain the high data plane utilization on the firewall. To troubleshoot that, the easiest way would be check if the SonicWall can report flow statistics that would show the most aggressive flows - and then we could investigate their senders.

Best regards,
Peter

mlord · ‎08-01-2023

edit: double post