ā11-09-2016 05:29 PM - edited ā03-08-2019 08:06 AM
OK Cisco peeps, here's a weird one for your consideration.
I have this network with a Catalyst 2960XR stack as the switching core. I have Aerohive APs and lots of wireless clients. I have lots of VLANs. 100 for infrastructure and 102 for WiFi guests are of interest here. I have trunk ports configured to permit VLANs 100 & 102 to the Aerohives. So far so good. But here's what's happening:
A guest shows up with a WiFi client, connects to a guest SSID on VLAN 102. Most of the time this works great. The device gets a VLAN 102 DHCP lease from the DHCP service on VLAN 100 with a helper IP address configured, and we're off to the races. BUT, every now and then, especially under a heavy load like when there's a meeting with lots of WiFi clients connected, some clients acquire a lease but can't browse. Depending on the operating system, there's usually some indication in the WiFi settings that it's connected, but there's no Internet, almost like there's a default gateway problem.
While this is happening, I can't ping the address of the WiFi device from the Catalyst stack AND I can't even ping the Catalyst from the edge firewall. This condition can persist for a few minutes up to a few hours, after which, suddenly the device can be pinged from the Catalyst stack. With no intervention on my part. It just starts working.
Someone suggested this might be an ARP caching problem, and it still might be. But I've checked the ARP caching capacity of the Aerohive AP, the Catalyst and even the edge firewall, and all are well under their ARP caching capacities. I don't have a port-security policy enabled on any of the Catalyst ports that serve the APs. The MAC of the WiFi client seems to successfully register right away in all the networking components in the chain when it acquires a DHCP lease.
It's almost like the switch stack can't resolve the MAC of the WiFi device, even though it's in all the right ARP tables. Or there's some limit I don't know about on the number of MACs that can be reached on a single port.
I've spent hours, days even, on the phone with support engineers from Aerohive and Checkpoint. Seems like their gear is OK, so now I'm looking at the Catalyst a bit more closely. Shoot me your ideas, no matter how crazy about this if these symptoms mean anything to you.
Dale.
ā11-09-2016 07:43 PM
at time of issue, have you check arp table and mac address table on your 2960 stack? what does it say and if you clear both tables, do the re-populate after you ping a wifi client?
also do your wireless clients stay associated?
ā11-10-2016 10:55 AM
Hi Dennis. Yes, the MAC of the problem WiFi clients seem to register correctly in the ARP table of the Catalyst as soon as they acquire a DHCP lease, so that looks nominal at first glance. Nonetheless, I've tried clearing the dynamic entries using 'clear arp-cache', but that command either doesn't work, or the dynamic entries immediately repopulate in the time it takes to type 'sh arp' because the ARP table looks exactly the same before and after issuing that command.
The WiFi clients do stay associated. I've spent mountains of time on the phone with Aerohive adjusting radio strengths, monitoring clients in real time, fiddling with firewall policy, updating firmware, you name it, we've done it.
Weirder still, even when WiFi clients are in the non-browsing state, I can see their browser traffic passing in the edge firewall logging when they try to browse. HTTP requests go out the outside interface, replies come back, are NAT translated and sent back to the client. According to the edge firewall, everything is working fine. No denies, dropped packets or anything other than successful traffic. While this is going on though, a traceroute from the edge firewall inside interface to the WiFi client fails immediately at the Catalyst, which is the first hop. Then, after some period of time, the Catalyst decides it wants to play again and traffic starts moving on its own. Traceroute succeeds, clients can browse, the world is right again.
It really seems like an ARP problem with the Catalyst, but I can't see any reason why. I've turned on ARP debugging on the Catalyst, but I don't really know what that does. Probably just adds needless CPU usage at this point until I figure out how to use the debugging.
DG.
ā11-10-2016 11:33 AM
Hi Dale,
I'm assuming that the 2960-XR box is running IP-Lite feature set and is the default gateway for the VLANs.
First of all, this might be a bug. Which IOS version are you running on the switch?
How many clients are we talking about?
How is the CPU load on the switch during an outage?
You mention that the firewall looses connection with the switch during this time. When you ping from the firewall, are you pinging the switch IP facing the firewall or the ones facing the clients?
Regards,
Sigurbjartur
ā11-10-2016 12:32 PM
Hi SH,
I _think_ the stack is running IP-Lite licensing. That rings a bell from when I was buying these switches. Yes, I have DG interfaces defined on the stack for all the VLANs.
IOS = Cisco IOS Software, C2960X Software (C2960X-UNIVERSALK9-M), Version 15.0(2)EX5, RELEASE SOFTWARE (fc1)
# of clients varies, but by unscientific observation, I'd say the problem gets worse as the number of clients increases. Typical load from daily staff devices is around 20 WiFi clients on VLAN102. That number can increase to 50 or 60 when there's a meeting, so we're not talking about big numbers.
CPU usage rarely touches 5%. This is a huge stack, probably overkill for the job it's doing. It has been running for more than a year, but I need this kind of uptime to support a 911 dispatch service in the building that runs 24x7x365. I do have some switching redundancy configured, but a software version change I assume would require a stack reboot. So I'll only do that if I'm convinced that'll work, but at this point I'm running out of ideas so that's not out of the question. Are you aware of a specific bug in IOS that fits these symptoms, or are you just floating the idea of a bug?
When I ICMP the Catalyst from the edge firewall inside interface (10.0.0.250), I'm trying to hit the DG address of the infrastructure VLAN 100 (10.0.0.254), which is the same VLAN the firewall inside interface lives on. 10.0.0.254 is the first hop in a traceroute from the firewall (10.0.0.250) to the Wifi clients (10.0.2.0/24) and that's the place where the tracert first fails when there's a problem. So weird, because even when this is going on, the edge firewall logs successful traffic from the Wifi clients, which should be impossible.
Does any of that make sense to you?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide