Spike of DHCP_SNOOPING-4-QUEUE_FULL causing clients to lose connectivity

Michael Shimmins · ‎06-28-2016

Hi there,

Today we had an incident that started with people complaining they couldn't connect to our wifi after coming back from lunch. Long story short, clients were no longer able to get IP addresses from our DHCP server, and were experiencing this by waking their laptops up from sleep and being unable to get "online". Although the primary people complaining about this were wifi users (99% of our users) it also affected wired users who were too unable to get a new IP address.

After digging deeper, we've found a huge number of log messages generated by our core switch (6500 with sup720s):

dhcp_snoop_redQ, the queue is most likely full and packet will be dropped.

In the span of a couple of hours we had 452 occurrences of that message in the logs, and also a corresponding spike compared to normal days of this message:

Host aaaa.bbbb.cccc in vlan 200 is flapping between port Po4 and port Po1

On a typical day we see about 30 of these messages, and they look to be mostly to do with wireless clients roaming (we have Meraki access points in bridge mode, on each floor - Po4 is the port channel connected to one floor, and Po1 is a port channel to another floor, etc).

Checking snooping statistics

sh ip dhcp snooping statistics detail

we see:


Packets Processed by DHCP Snooping = 5050352
Packets Dropped Because 
 IDB not known = 0
 Queue full = 150724
 Interface is in errdisabled = 0
 Rate limit exceeded = 0
 Received on untrusted ports = 0
 Nonzero giaddr = 4648
 Source mac not equal to chaddr = 0
 No binding entry = 0
 Insertion of opt82 fail = 0
 Unknown packet = 0
 Interface Down = 0
 Unknown output interface = 6

On any given day we never see the dhcp_snoop_redQ error. We've seen it only once prior on this switch about 3 months ago, and in that instance it was out of hours, with very few clients in the building, so the switch was rebooted, which immediately resolved the issue.

Today's incident occurred at peak usage period, so the switch wasn't able to be power cycled. We stood up a workaround wifi solution for 99% of users, and they got back to work.

A few hours later the issue is no longer present and IP addresses are allocated as normal without these messages in the logs.

Our topology has a number of wireless access points connected to a Cisco 4500 switch on each level. Each 4500 switch is connected by 10ge fibre to the core 6500, which has the VLAN, including helper address setup for DHCP servers. The DHCP servers are Windows DHCP servers, which allocate DHCP for each VLAN.

The Sup720s in the 6500 are running ADVENTERPRISEK9-M 15.1(2)SY5

Does anyone have any suggestions for further troubleshooting, thoughts as to a possible cause, and tips for preventing this from occurring again?

Philip D'Ath · ‎06-29-2016

Although the software version is different, it sounds a bit related to this bug:

https://quickview.cloudapps.cisco.com/quickview/bug/CSCtg94023

Thinking sideways - how good is your DHCP server? Could you upgrade it to something that responds faster? The faster the DHCP transactions get turned around the less full the queue should be.

I see 15.1.2-SY7 is out. Could you upgrade to that? When I search the release notes for DHCP about half a dozen resolved caveats appears to have been resolved.

http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/15-1SY/release_notes.html