Solved: Switch not reliably forwarding DHCP requests

RyanMcC · ‎07-14-2020

Hi, hopefully this is the right place for this question.

We have a switch stack of three 9300 switches running IOS XE that handle networking for the office area. The office machines sit on a single vlan. DHCP is provided by a Linux DHCP server sitting on the same vlan on another switch. Everything had been working fine for the last few years, then, a few weeks ago, machines in the office stopped reliably receiving DHCP addresses. Leave a machine for 5-10 minutes and it would probably get an IP address and then work flawlessly until it released it.

I've been having a very enjoyable time of troubleshooting and packet capturing. We have rebooted the switch stack (I'm a big fan of turn it off and on again), no change; No changes to its config; If you plug a machine into the other switch (or any other switches with the same vlan) the machine gets a DHCP address almost instantly.

Running packet captures on various things, machines plugged into the switch stack are sending out DHCP requests, the majority of the DHCP broadcasts don't make it outside the switch. If they do, the DHCP server responds straight away, those responses make it back to the client which will respond (via a broadcast) and most likely those packets disappear as well. The negotiation times out and the cycle begins again. Eventually, during one of those cycles packets will magically get forwarded, the machine gets an IP, the broadcast packets are no longer happening and everything continues fine.

I can't find any counters, errors or logs the switch is dropping or not forwarding packets, but it seems to be. From the last packet capture, over a period of about five minutes I saw:

35 DHCP discovery packets

1 DHCP offer packet from the server

8 DHCP request packets from the client

This cycle repeats once then

10 DHCP discovery packets

1 DHCP Offer from the server

1 DHCP request from the client

1 DHCP Ack from the server

and then everything worked fine.

On a different switch, there might be two DHCP recover packets, then one of each of the others and it is close to instantaneous.

Anyone got any hints or tips for something I can look for? It seems to be something to do with the DHCP broadcasts, but what and where they go, I have no idea. The silence that seems to be dropping this packets is one of the things that get me. I wouldn't feel as bad if there was a counter going up somewhere that at least provided a clue as to what is going on. The fact it had been working and now is not, is one of the puzzling things.

Cheers.

RyanMcC · ‎07-16-2020

Just for anyone who finds this. The network guys updated to the latest and greatest IOS XE Gibraltar 16.12.4, which was only released a few days ago. It fixes bug CSCvs91593 we don't explicitly match those all of those criteria, but there are some similarities which are very close. After the update and a reboot, switch appears to be back to working normally, not sure if we were maybe an edge case for the bug or something else in the release fixed it, but I'm just happy it's back behaving itself again.

View solution in original post

paul driver · ‎07-14-2020

Hello

Do you have a large enough dhcp scope for that vlan, have you checked the scope isn’t getting exhausted with to many requests or a to short of a lease time?
Do you have Stp portfast enabled on all access ports?
Do you have any layer security features enabled such as port-security, Dhcp snooping , IPSG, DAI?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

RyanMcC · ‎07-15-2020

Plenty of room left in the DHCP scope (about 100 IPs available for currenly about 15 machines). From the packet dumps, when a request makes it to the DHCP server, the server responds with an IP pretty much immediately.

The default port settings have a lot of options:

interface GigabitEthernet3/0/5
description DHCP testing
switchport access vlan 3
switchport mode access
switchport port-security maximum 3
switchport port-security aging time 5
switchport port-security aging type inactivity
switchport port-security
trust device cisco-phone
snmp trap mac-notification change added
snmp trap mac-notification change removed
auto qos voip cisco-phone
spanning-tree portfast
spanning-tree bpduguard enable
service-policy input AutoQos-4.0-CiscoPhone-Input-Policy
service-policy output AutoQos-4.0-Output-Policy
ip dhcp snooping limit rate 12
end

But I've also tried stripping it down to the bare minimum (I think that's the bare minimum):

interface GigabitEthernet3/0/5
description DHCP testing
switchport access vlan 3
switchport mode access
end

And I get the same result.

Abdulila Alhosaine · ‎07-15-2020

Check the DHCP Server configuration. You need to check the Lease Time and the scope Range. Do you have Static Roating to the default Gateway? or you use DHCP Relay in the FW?
Check if there is a Duplicate IP address in your network.
Check the Cable, STP configuration.
Check the output of sh ip dhcp snooping binding.
Check the configuration for vlan3

RyanMcC · ‎07-16-2020

Just for anyone who finds this. The network guys updated to the latest and greatest IOS XE Gibraltar 16.12.4, which was only released a few days ago. It fixes bug CSCvs91593 we don't explicitly match those all of those criteria, but there are some similarities which are very close. After the update and a reboot, switch appears to be back to working normally, not sure if we were maybe an edge case for the bug or something else in the release fixed it, but I'm just happy it's back behaving itself again.

Might Ncube · ‎08-13-2022

Hi all.

Had been searching everywhere for a solution for this issue. While the conditions discussed here are not 100% similar to mine, I am convinced that the issue is the same.

1) My environment runs a Cisco 3850 stacks access switches with two vlans, Voice and Data only. They running IOS XE 16.12.5b. DHCP servers are Windows 2016 and are tucked away in the Datacenter.

2)We using NAC supplied by Ivanti, formerly Juniper, Pulse Policy Server with dot1x and MAB.

3) We do not have snooping enabled.

The issue started on the 3rd day of NAC implementation,, the client would send a Discover for an IP, the switch would successfully forward the message to the server, server would send back the Offer, and for an odd reason, the offer would not make it to the client, effectively cutting short the dhcp proccess.

Another indication that the issue was on the switch was that the ARP Table had the IP to Mac mapping even though the Client remained with an APIPA address. On the access-session table, the endpoints were MAB authorised, non on dot1x.

We been all over everything trying to understand why the switch would not forward the Offers to the clients, we focused on DHCP options without success, did a lot of packet captures in different places, suspected rate limiting and cpp policies.

Now that we have found this Bug disclosure, would like to upgrade the code and observe, if that solves the issue, would revert back to this forum.

Dan Hoag · ‎04-19-2024

Hello. Just wondering if you were able to resolve this? We're on 16.12.10a with what sounds like the exact scenario.

Might Ncube · ‎04-19-2024

@Dan Hoag we did eventually get it fixed and we ran the commands on all the switches and we have not experienced any DHCP issues ever since.

I just copied from a production switch right now.

ip device tracking probe auto-source fallback 0.0.0.1 255.255.254.0
ip device tracking probe delay 10
ip dhcp relay information policy keep
ip dhcp relay information trust-all

Regards Might