06-28-2016 04:36 AM - edited 03-08-2019 06:23 AM
Hi there,
Today we had an incident that started with people complaining they couldn't connect to our wifi after coming back from lunch. Long story short, clients were no longer able to get IP addresses from our DHCP server, and were experiencing this by waking their laptops up from sleep and being unable to get "online". Although the primary people complaining about this were wifi users (99% of our users) it also affected wired users who were too unable to get a new IP address.
After digging deeper, we've found a huge number of log messages generated by our core switch (6500 with sup720s):
dhcp_snoop_redQ, the queue is most likely full and packet will be dropped.
In the span of a couple of hours we had 452 occurrences of that message in the logs, and also a corresponding spike compared to normal days of this message:
Host aaaa.bbbb.cccc in vlan 200 is flapping between port Po4 and port Po1
On a typical day we see about 30 of these messages, and they look to be mostly to do with wireless clients roaming (we have Meraki access points in bridge mode, on each floor - Po4 is the port channel connected to one floor, and Po1 is a port channel to another floor, etc).
Checking snooping statistics
sh ip dhcp snooping statistics detail
we see:
Packets Processed by DHCP Snooping = 5050352
Packets Dropped Because
IDB not known = 0
Queue full = 150724
Interface is in errdisabled = 0
Rate limit exceeded = 0
Received on untrusted ports = 0
Nonzero giaddr = 4648
Source mac not equal to chaddr = 0
No binding entry = 0
Insertion of opt82 fail = 0
Unknown packet = 0
Interface Down = 0
Unknown output interface = 6
On any given day we never see the dhcp_snoop_redQ error. We've seen it only once prior on this switch about 3 months ago, and in that instance it was out of hours, with very few clients in the building, so the switch was rebooted, which immediately resolved the issue.
Today's incident occurred at peak usage period, so the switch wasn't able to be power cycled. We stood up a workaround wifi solution for 99% of users, and they got back to work.
A few hours later the issue is no longer present and IP addresses are allocated as normal without these messages in the logs.
Our topology has a number of wireless access points connected to a Cisco 4500 switch on each level. Each 4500 switch is connected by 10ge fibre to the core 6500, which has the VLAN, including helper address setup for DHCP servers. The DHCP servers are Windows DHCP servers, which allocate DHCP for each VLAN.
The Sup720s in the 6500 are running ADVENTERPRISEK9-M 15.1(2)SY5
Does anyone have any suggestions for further troubleshooting, thoughts as to a possible cause, and tips for preventing this from occurring again?
06-29-2016 12:32 AM
Although the software version is different, it sounds a bit related to this bug:
https://quickview.cloudapps.cisco.com/quickview/bug/CSCtg94023
Thinking sideways - how good is your DHCP server? Could you upgrade it to something that responds faster? The faster the DHCP transactions get turned around the less full the queue should be.
I see 15.1.2-SY7 is out. Could you upgrade to that? When I search the release notes for DHCP about half a dozen resolved caveats appears to have been resolved.
http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/15-1SY/release_notes.html
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide