Re: C881/K9 DHCP server stops working for one pool only, after a couple days of operation

BostonAutomation · ‎07-04-2018

All,

I am an IT service provider with a number of Cisco ISR devices deployed for several of my customers. I've been rolling out Cisco IOS based solutions for over 20 years (since the early 90s) and generally I've been able to troubleshoot anything that's come up. But this time, I'm at a loss. I'm pretty sure I've run into a bug of some sort, except that updating the firmware a couple of times has not altered the problem one bit, plus a pretty exhaustive search of both this site and the Internet at large doesn't seem to have any other reports of this issue. I'm wondering if anybody here can see anything I'm missing, or, alternatively, suggest a troubleshooting avenue to explore that I haven't already.

Here's the story: Company with multiple sites, connected together with DMVPN (GRE tunnels, NHRP, etc.) The company headquarters site is running a 2900 series router (2951/K9, to be exact), with 800 series routers (primarily C881/K9) installed at the branch locations. I mention this just for context; the problem I'm having is not related to the VPN or any of the sites ... except one.

At this one site, there is a C881/K9 router, running C800 Software (C800-UNIVERSALK9-M), Version 15.3(3)M5, RELEASE SOFTWARE (fc3). This router includes two VLANs. The default VLAN (1) is for employee use and has a connection through the VPN to HQ and the other sites. There is also a second VLAN (99) which is used for Public Wi-Fi. The Wi-Fi at this site is provided through a series of Open Mesh access points, each of which serves VLAN1 (the private network) via one SSID and VLAN99 (the public network) via another SSID. As you would expect, the Wi-Fi for VLAN1 is secured, whereas the Wi-Fi attached to the public VLAN is open. Again, nothing too unusual or controversial here.

The problem I'm having has to do with DHCP. The C881/K9 router at this site provides DHCP service for both VLANs. And that DHCP services works fine, for a while. In fact, for VLAN1, it works all the time, with no problems. But for VLAN99, the DHCP server simply stops talking after a couple of days. DHCP requests go unanswered, and I cannot revive that part of the DHCP service except via reload of the router. Nothing else works. I can bring VLAN99 down and up, I can type various commands to clear the pool in question, I can delete and recreate the pool... nothing gets the router to start assigning addresses in the VLAN99/public pool except restarting the router. I've actually taken to adding a kron event that reboots the router once a day in the wee hours, just to try to keep DHCP working. This is obviously not a preferred solution.

Here are the relevant configuration lines:

no ip dhcp conflict logging
ip dhcp excluded-address 192.168.176.1 192.168.176.99
ip dhcp excluded-address 192.168.176.200 192.168.176.254
ip dhcp excluded-address 192.168.99.1 192.168.99.10
ip dhcp excluded-address 192.168.99.251 192.168.99.254
ip dhcp ping timeout 200
!
ip dhcp pool MY_POOL
 import all
 network 192.168.176.0 255.255.255.0
 netbios-name-server xx.xx.xx.xx
 domain-name xxxx.local
 default-router 192.168.176.1 
 netbios-node-type h-node
 dns-server xx.xx.xx.xx yy.yy.yy.yy 
 lease 0 12
!
ip dhcp pool public-pool
 import all
 network 192.168.99.0 255.255.255.0
 default-router 192.168.99.1 
 dns-server yy.yy.yy.yy zz.zz.zz.zz 
 lease 0 3
!
ip dhcp pool static-xxxx
 host 192.168.176.102 255.255.255.0
 client-identifier xxxxx.xxxx.xxxx.xx
 domain-name xxxx.local
 default-router 192.168.176.1 
 dns-server yy.yy.yy.yy zz.zz.zz.zz 
...
...
...
class-map match-any p2p
 match protocol edonkey
 match protocol fasttrack
 match protocol gnutella
 match protocol kazaa2
 match protocol winmx
 match protocol bittorrent
class-map match-any AutoQoS-VoIP-RTP-Trust
 match ip dscp ef 
class-map match-any AutoQoS-VoIP-Control-Trust
 match ip dscp cs3 
 match ip dscp af31 
!
policy-map p2p-drop
 class p2p
  drop
policy-map AutoQoS-Policy-Trust
 class AutoQoS-VoIP-RTP-Trust
  priority percent 70
 class AutoQoS-VoIP-Control-Trust
  bandwidth percent 5 
 class class-default
  fair-queue
...
...
...
interface FastEthernet0
 switchport mode trunk
 no ip address
!
interface FastEthernet1
 switchport mode trunk
 no ip address
!
interface FastEthernet2
 switchport mode trunk
 no ip address
!
interface FastEthernet3
 switchport mode trunk
 no ip address
...
...
...
interface Vlan1
 ip address 192.168.176.1 255.255.255.0
 ip nbar protocol-discovery
 ip nat inside
 ip virtual-reassembly in
 auto qos voip trust 
 service-policy input p2p-drop
 service-policy output AutoQoS-Policy-Trust
!
interface Vlan99
 description Public WiFi
 ip address 192.168.99.1 255.255.255.0
 ip nbar protocol-discovery
 ip nat inside
 ip virtual-reassembly in
 ip tcp adjust-mss 1452
 service-policy input p2p-drop

Note that I include the class and policy map stuff because it's referenced in the interface definitions; I doubt they have anything to do with this, though, since the p2p-drop policy is applied to both VLAN interfaces and only VLAN99 is having DHCP problems.

I have tried a bunch of stuff to try to track this down; I've turned on debug logging, etc. Nothing has jumped out at me. I've also tried moving the public pool and VLAN to a different range of IP addresses (and a bigger one - I thought maybe the 254 addresses available in the current Class C block might not have been enough). But none of that helped. All I can see is that, at some point, without any particular triggering event I can identify, the router stops handing out public-pool addresses, and eventually (after all the leases expire), the pool ends up empty, showing all addresses available. But it won't assign them to anyone.

Oh, and by the way, I mentioned there are other sites. This same configuration (or something damned close) exists at at least 4 other locations, and not one of them has this problem.

I'm at my wit's end here. Any ideas, anyone?

Leo Laohoo · ‎07-04-2018

@BostonAutomation wrote:

There is also a second VLAN (99) which is used for Public Wi-Fi. The Wi-Fi at this site is provided through a series of Open Mesh access points, each of which serves VLAN1 (the private network) via one SSID and VLAN99 (the public network) via another SSID. As you would expect, the Wi-Fi for VLAN1 is secured, whereas the Wi-Fi attached to the public VLAN is open.

So VLAN 99 is a DHCP pool for wireless clients.

When this issue occurs, is the DHCP pool exhausted? Like someone trying to exhaust the DHCP pool or other means of DoS?

johnlloyd_13 · ‎07-04-2018

just to add on leo's post.

it could also be a rouge DHCP in the wifi network, which is common.

a few questions:

did you check the 881 router DHCP statistics if address pool has enough IPs?

did you ask the user for their ipconfig /all and trace the MAC address of its default gateway?

BostonAutomation · ‎07-05-2018

Thanks, folks, for the thoughts. Unfortunately, yes, I did check both of the things you suggested.

I did check to see if the pool was exhausted -- that was my first thought, in fact, but no, plenty of available IPs in the pool. In fact, as I said, when this happens, by the time I find out about it and check, usually the entire pool is available, because the lease time is set to only 3 hours. So not only does that fact in itself mean that the pool is not likely to be exhausted, but it also means that even if it got exhausted, things would free up pretty quickly. And usually, by the time I get wind of it (because someone has called me to complain that "nobody can get on the public Wi-Fi"), the pool statistics show 0% of the pool used and 100% available. Also, I changed the size of the pool by changing the public network from 192.168.99.0/24 to 172.30.0.0/16 (thus increasing the pool size from about 250 addresses to over 65,000) -- made no difference. Server stopped giving out addresses after more or less the same amount of time.

As to a rogue DHCP server on the network, yeah, that was something like my second thought, and unfortunately, it doesn't check out either. When this happens, people trying to get IPs from the public Wi-Fi network get nothing -- they get DHCP timeouts, not some incompatible address from some other server. The vast majority of the connecting devices, by the way, are mobile phones, not computers, so ipconfig wouldn't be very helpful. But equivalent tools/apps on mobile devices reveal what I said -- no address has been assigned at all.

So thanks for both of these, but alas, no dice. :(

Any other ideas?

Leo Laohoo · ‎07-05-2018

@BostonAutomation wrote:

Server stopped giving out addresses after more or less the same amount of time.

So the DHCP stops dishing out IP address at approximately the same time of the day? And what time would this be?

What happens if you create a brand new pool? And this pool gets hosted by something else (other than the router)?

BostonAutomation · ‎07-05-2018

@Leo Laohoo wrote:
So the DHCP stops dishing out IP address at approximately the same time of the day? And what time would this be?

Not really the the same time of day; I meant that the elapsed time between router startup and time that the router stops handing out addresses in this pool always seems to be, in very broad terms, somewhere around, say, 36 to 48 hours. My point was that if I increase the size of the pool by a factor of 256, it doesn't appreciably affect this. If it were related to exhaustion of the pool, you'd expect the larger pool to make things go much longer. It doesn't.

@Leo Laohoo wrote:

What happens if you create a brand new pool? And this pool gets hosted by something else (other than the router)?

Well, that's going to be my next step. I actually have a little Raspberry Pi device here that I'm going to throw a DHCP server on and plug into the network on VLAN 99 only. And then I'll just get rid of the pool on the 881 and call it a day. I would expect that to work (though anything is possible). But if did, it would basically mean sweeping this problem under the rug rather than solving it. And I'll do that if I have to, but man, I hate giving up like that...

Leo Laohoo · ‎07-05-2018

Use the rPi to do the DHCP work for, say, 2 weeks. If this issue doesn't occur, then you're looking at a bug.

BostonAutomation · ‎07-05-2018

@Leo Laohoo wrote:
Use the rPi to do the DHCP work for, say, 2 weeks. If this issue doesn't occur, then you're looking at a bug.

Yeah, that was pretty much the conclusion I came to too, Leo. Was hoping someone would see something funky in my configuration that I've been missing, or know of an existing bug with some known workaround. I guess not. But thanks, at least, for confirming my analysis. At least I know I'm not crazy. :)

Leo Laohoo · ‎07-05-2018

Just to add, the issue could either be a bug (in the IOS) or a misconfiguration somewhere.
I remembered my original config was for a DHCP pool with /24 and I changed it (about a year later) to a /23 but I forgot to change the NAT table. So what happened was that half the network could get to the net while the other couldn't.

compendic · ‎08-13-2018

So... several weeks later, I have some new, potentially interesting information about this.

As I had suggested I might try, I did add a Raspberry Pi device into the network and gave it the task of handing out DHCP for the subnet that the router had stopped doing it for at some point.

And, indeed, it solved the problem. Except not really. What it did was reveal the real problem, which (not surprisingly, at this point) was not about DHCP at all. You see, once I added the new device to do DHCP for that subnet, DHCP assignments continued properly and did not stop. But devices in the subnet in question still were no longer able to reach the Internet after a certain amount of time elapsed. And, in fact, it turns out that, when this happens, devices in the affected subnet can no longer reach (ping) the router, either!

So, after a lot of troubleshooting, I discovered what the actual problem is. It appears to not be a DHCP issue at all; it is, instead, a VLAN problem. After operating for a period of time, the router stops responding to 802.1q tagged packets on its switchport interfaces (operating in Trunk mode). I don't know if it's a bug in the 802.1q process or something else, but it's definitely VLAN related. Once this problem has occurred, the router cannot ping (or respond to pings) to any device on the tagged VLAN. The VLAN Interface itself seems to be up, and it can ping itself at that IP address. But traffic no longer moves between the VLAN interface and ethernet switchport interfaces that are configured in trunk mode.

I'm considering starting a new topic here to ask about this newly discovered problem (which, by the way, I have now observed on both this 881 router and, now, an 891F router as well). But before I do, is there anything in my configuration (refer back to the top of this thread) that seems wrong there? I mean, it's pretty straightforward....

Georg Pauwen · ‎08-14-2018

Hello,

on a side note, what brand/model are the WiFi access points ?

There is not much you can do to configure the trunk ports on the 881 other than specifically only allow Vlan 99 (switchport trunk allowed vlan 99), I wonder if that makes a difference... if you add that to the trunk ports...