WiFi clients issuing 200 DHCP Requets per second

stuartkendrick · ‎02-18-2020

BRIEF

- Anyone else seeing WiFi clients issuing bursts of DHCP Requests which trigger 'ip dhcp snooping rate limit xx' thresholds on Catalysts?

DETAIL

- The flock of Catalyst 2960X at the access-layer have several hardening features enabled, including the DHCP Snooping feature

interface GigabitEthernet1/0/1
 [...]
storm-control broadcast level 1.00
storm-control multicast level 1.00
storm-control action shutdown
storm-control action trap
spanning-tree portfast edge
spanning-tree guard root
ip dhcp snooping limit rate 25

Intermittently, some ports see bursts of DHCP and trigger the err-disable behavior

12:27:34 5n-2-esx dhcp-rate-limit error detected on Gi1/0/15, putting Gi1/0/15 in err-disable state
12:29:34 5n-2-esx-mgmt Attempting to recover from dhcp-rate-limit err-disable state on Gi1/0/15

Tracking these down, I find that the Catalyst ports affected by this event feed Wireless Access Points (Meraki MR33)

A typical day might include ~5-25 of these events (from a population of ~70 WAPs servicing ~600 WiFi clients)

I have run ethanalyzer on the upstream L3 gear to capture DHCP frames on the Sup card (aka 'in-band')

nxos# ethanalyzer local interface inband capture-filter "udp port 67 or (vlan and udp port 67)" limit-captured-frames 500000 capture-ring-buffer files 64 write bootflash:2020-02-16/mdf-a-rtr-inband-dhcp.pcap

In these pcaps, I see bursts of DHCP Discovers (and corresponding DHCP Offers from the DHCP server) -- 25 or 26 in a row and then silence... I interpret this behavior as consonant with the dhcp-rate-limit behavior, i.e. the Cat2960X forwarded ~25 DHCP frames within a single second, then disabled the WAP port.

For grins, I increased the dhcp-rate-limit parameter from 25 to 200 ... and captured a pcap in which 200 DHCP Discovers appeared in a single second

BTW: in these pcaps, the offending client sets the DHCP Transaction ID in an apparently random fashion. [Normally, a client which wants a DHCP address will set the DHCP Transaction ID to some starting number and then increment it by one with each fresh DHCP Discover.] This smells like buggy behavior to me: (a) excessively rapid series of Discovers, plus (b) chaotic Transaction ID

==> Question: does anyone know how I would go about filing a bug report with the Android or iOS folks (I have seen both Android phones and iPhones/iPads as sources of these DHCP bursts)

Total downtime for the WAP is ~3-4 minutes, i.e. (2) minutes during which the Cat2960X has the port disabled, and ~2 minutes for the MR33 to reboot, once the Cat2960X re-enables the port

IMPACT

I'm not clear on the impact to WiFi clients. For areas serviced by a single WAP, I suspect that the impact is substantial (i.e. no WiFi connectivity for ~4 minutes for clients located in that area). For areas serviced by multiple WAPs, I don't know what the end-user experience looks like, as the device roams away from the now silent WAP to an active one

WISHFUL THINKING

What I really want is the 'dhcp-rate-limit' feature implemented on the Meraki WAPs -- they don't currently support this feature. [I have submitted an enhancement request.] i.e. I want the WAPs to shut off aberrant clients

==> Question: Does anyone know if Cisco WAPs (aka the Catalyst 9000s) implement dhcp-rate-limit?

REMEDIATION

Seems to me that I could pursue one or more of the following:

(a) Investigate traffic-shaping on the MR33s (perhaps this feature would support throttling DHCP bursts to something that the Domain Controllers can tolerate)

(b) Disable dhcp-rate-limit on the Catalysts and just let the Domain Controllers get pounded by DHCP Discovers. Perhaps they can handle this just fine

(c) Live with the intermittent loss of WAPs -- it's only for a few minutes per incident

==> Question: does anyone see another remediation option?

ANYONE ELSE?

==> Question: Anyone else seeing this behavior?

--sk

Philip D'Ath · ‎02-18-2020

First any security measure frequently false triggering and prevents actual users from using the infrastructure is not worth keeping. I'd simply turn off the DHCP snooping on the AP ports.

My next thought is are the MR33's using a recent stable release of firmware? The current stable firmware release is 26.6.1. If you are not using a stable release train of code start by upgrading to that.

Next, you will probably get more help over on the Meraki community.

https://community.meraki.com/

stuartkendrick · ‎02-19-2020

Yes, the flock of Meraki WAPs are running 26.6.1: but why does this attract your attention? Seems to me that the Meraki units are bystanders here -- where do you see room for a Meraki bug coming into play?

Yup, I started at the Meraki community -- that's where I picked up the traffic shaping idea (which I have yet to explore)

I would argue that the Catalyst 2960X are not 'falsely triggering' -- from the pcaps, I can see DHCP Discover bursts which hit the 'ip dhcp snooping limit rate 25' threshold which I have set (and which occasionally fires to block random Wired stations)

What I'm looking for:

==> Is anyone else seeing this? And if so, what did you do about it?

I suspect that most folks land into one of the following camps:

(1) Running a WAP OS with its own 'ip dhcp snooping limit rate 25' or similar threshold in place, such that this problem is blocked right at the edge, clamping these bursts as they get retransmitted into the Wired network (Do the Cisco Catalyst WAPs sport this function?)

(2) Saw this issue and removed the 'ip dhcp snooping limit rate 25' protection ... and their DHCP servers are either shrugging off the bursts or are stumbling under the load (and the site is living with the resulting problems)

(3) Have never enabled the 'ip dhcp snooping limit rate 25' feature on Wired ports feeding WAPs ... and land somewhere inside the consequences mentioned in #2 above

--sk

YDOT1Q · ‎04-10-2023

Yep I have seen this.

Port connected to APs constantly going into err-disable.

But I wonder how that could happen since CAPWAP is tunneling all traffic from the end device. Does snooping still glean to that level ?

I am curious. Any ideas?

Scott Fella · ‎04-10-2023

Look at what is causing the err-disable, might be a configuration line item you have on the ap switch port. Take one ap switch port and keep the basic. See if the port stays up, it should.

-Scott
*** Please rate helpful posts ***

stuartkendrick · ‎02-19-2020

Here is an example of what I see on the wire -- notice the (24) DHCP Discover / Requests pumped out by the client (bc:83:85:db:a3:2e) within .399675s, starting in Frame 4040. I claim that the following then happened:

The Catalyst 2960X then disabled the port feeding the relevant WAP
The client re-associated to another WAP
Then the client issued its first DHCP Request via the second WAP, in Frame 4148
Eventually, the client roams back to the first WAP ... and about 45 minutes later, the client repeats this experience

patoberli · ‎02-19-2020

Is this a Surface device?
I'm a tad confused by the mac address which is supposed to be "Microsoft".
I think this is more of a Windows or LAN adapter driver issue.

stuartkendrick · ‎02-24-2020

Yes, the fact that Wireshark decodes the first (6) bytes of the MAC address suggests that this device is a Microsoft tablet rather than, say, a laptop with an {Intel | Broadcom | something else} NIC in it

I would agree that the underlying pathology probably starts with a bug in the device's IP stack (well, DHCP stack) ... but I don't know how to get that fixed -- I figure my job is to build an infrastructure that can survive a lot of these bursts of DHCP Requests

Massimo Baschieri · ‎02-19-2020

Do you have ip device tracking enabled on the 2960X's?

I've experienced a similar behaviour in the past with some specific devices which suffered that feature

The link below describe exactly the issue I had and the solution, the only difference is that I experienced the issue on 2960X's and 2960S's without dot1x:

https://www.dslreports.com/forum/r31878285-802-1x-IPDT-issues

stuartkendrick · ‎02-24-2020

I hadn't heard of 'ip device tracking' ... that is a neat feature. But no, we don't run it (looks like it ships disabled by default)

--sk

Massimo Baschieri · ‎02-24-2020

Have you checked with "show ip device tracking all" command ?

That feature is disabled by default not on all releases and even when that's true it gets suddenly enabled as soon as you enable other features.

Take a look at the following thread:

https://community.cisco.com/t5/switching/ip-device-tracking/td-p/3020135

stuartkendrick · ‎02-25-2020

I have just manually checked a handful of my Cat2960X Stacks (all of which are reporting these dhcp-rate-limit events on ports feeding WAPs); ip device tracking is disabled

p1n-esx#sh ip device tracking all
Global IP Device Tracking for clients = Disabled
--------------------------------------------------------------------------------
---------------
IP Address MAC Address Vlan Interface Probe-Timeout State
Source
--------------------------------------------------------------------------------
---------------

p1n-esx#

stuartkendrick · ‎04-10-2023

We use Meraki Access Points, so no CAPWAP involved

--sk