03-27-2024 07:48 AM
Sorry for the long post. We're having a complicated issue I'm hoping someone can shed some light on.
We recently migrated uplinks for our (3) SSO 8540 WLCs (8.10.190.6) and (2) SSO 9800 WLCs (17.9.4a/APSP8) off a 6500 VSS pair and onto a 9500X stacked pair. The initial migration was just a Layer 2 migration, and it went well. The second part was moving routing off the 6500 and onto the 9500X. Next business day, all the controllers started randomly switching over due to no ping response from the gateway, we were losing pings from our NMS to some downstream switches and firewalls, clients had trouble connecting/had delays in starting to load Web pages, we saw lots of incomplete ARP entries in the 9500X ARP table, etc.
We realized that, due to control plane policing on 9500Xs limiting the number of ARP/SNMP/ping/etc. packets per second, ARP packets for the default gateway were getting dropped. We raised the limit from the default 1000 pps to 10000 pps, which stabilized the environment, but now the 9500X CPU is at 55% or so constantly (as was the 6500 before migration).
We've observed disproportionate ARP traffic incoming to the 9500X; for every 5k packets that go out, 2 million come in. (There are around 50k wireless clients conected during the day.) We're trying to determine if this is expected/normal behavior or if there's some bug or misconfiguration. We have a TAC case open and took some packet captures, and they are going to (try to) reproduce the issue in their lab. They're focusing on a 15-second interval in which 1207 ARP requests left the 9500's SVI.
We're continuing to investigate on our own and have a packet sniffer attached to a trunk port on the 9500 that's mirroring the traffic incoming from a specific WLC. We're seeing ARP requests originating from our internal VLAN to our guest VLAN and vice versa. This only happens on the 8540 WLCs, not 9800 WLCs. (We have ARP proxy enabled on the 9800s.) Is this normal/expected, or is there some setting that I should look at? The only reference for ARP proxy I see on AireOS is related to passive clients (see below), but I can't find any information about setting its status (if that's even an option).
I was also troubleshooting a client-specific issue and took a packet capture from a 9800 WLC and observed that, when the client sent an ARP request for the default gateway, it got a couple dozen responses from Intel devices with IPs other than the gateway. Same thing after the client sent an ARP announcement a second later. I'm wondering if there's some bug with Intel NICs causing them to respond to packets they shouldn't be. But with ARP proxy, shouldn't the Intel devices never receive the ARP request from the client? Unless maybe they're associated to the same AP/radio as the client?
Thanks for any insight you may be able to give.
03-27-2024 09:00 AM
- There will be too many elements here to provide a clear solution and insights , I can only give some advises :
ARP proxy should be handled by the controller : https://www.cisco.com/c/en/us/td/docs/wireless/controller/9800/17-3/config-guide/b_wl_17_3_cg/m_arp_proxy.html#config-address-res-prot-prox-cli
- As far as 6500 VSS pair and onto a 9500X stacked pair stuff is concerned , simpler topologies could be tried by adding an extra leaf switch ; terminate the 9800 on that and have the leaf switch connected to the above infrastructure(s)
- Always applicable and general advises : Have a checkup of the 9800 WLC configuration with the CLI command show tech wireless and feed the output to : Wireless Config Analyzer
For 8540 use : WirelessAnalyzer input (procedure) for AireOs controllers
and feed the output in WirelessAnalyzer too
M.
03-27-2024 09:16 AM
Did ypu push defualt gateway IP to wifi clients via dhcp?
MHM
03-27-2024 11:59 AM
@marce1000 - Yes, we do have ARP proxy enabled on the 9800 controller. The weird inter-VLAN ARPing only seems to happen on the 8540 controllers, not 9800s. I haven't find any information on enabling/setting ARP proxy on AireOS (if it's an option).
As for topology, it is pretty simple as is. Below is what it looks like (diagram from here). The StackWise pair is the 9500X pair, which is configured as an L3 switch. Same for all 5 controllers, the only difference being the 8540 RPs are copper so are not directly connected, rather are hooked up to a C9300 on each side that have L2 port channel uplinks to the 9500s.
@MHM Cisco World - Yes, the default gateway is provided by DHCP. That reminds me, our internal network is a single /16 subnet in a VLAN that is shared across all 5 controllers. Same with the guest network, though that's a multinetted SVI with three /19s.
04-08-2024 12:33 AM
From your last answer I guess there are 5 WLCs not in the same Mobility domain, is that right?
As every WLC manage the Proxy ARP for the clients they know taht are under the same Flex profile, I guess this might be something to do with this topology where clients connected to APs in one WLC respond to ARP of a client in another WLC. Maybe something not expected from this feature.
Please keep us updated on the findings.
04-08-2024 09:49 AM - edited 04-08-2024 09:51 AM
@JPavonM- Yes, the 5 WLCs (8540 and 9800) are in the same mobility group.
We don't use FlexConnect for local switching except for a handful of locations. That said, most of the APs on the 8540 WLCs are 1815Ws that are in FlexConnect mode, for RLANs for the wired ports so wired devices get switched locally (wireless clients still get switched centrally).
We have since learned that the amount of ARP traffic seen coming into the 9500 from the WLCs ranges from around 50 to 75pps on the 9800s to 1,500 to 2,000pps on the 8540s. All controllers have similar client counts. Also, the traffic coming into the 9500 from the 8540s is mostly unicast ARP responses with destinations of wireless clients. I'm not yet sure if the destination client is a client on a different controller or the same controller.
I also took a packet capture from a client connected to each controller (same SSID/same VLAN) and I got a ton of gratuitous ARPs and ARP announcements on the 8540 but only a single ARP request from the client to the default gateway and a corresponding response from the gateway. I checked that broadcast forwarding is disabled on the 8540s. I tried both on an 1815 and a 9130 on the 8540 and same thing.
We looked into our ARP and L2 cache timeouts and realized that the old 6500 router was set to 600 seconds for ARP timeout and 660 seconds for L2 cache (MAC address table) timeout. Both the 9500 switch and 9800 WLCs are using 4 hours for ARP timeout, with L2 cache not configured (I assume the default is 4 hours on that as well?). The 8540, however, only as 300 seconds configured for its ARP timeout. I'm not sure how to check/change the L2 cache timeout on an 8540.
TAC has advised us to make all the ARP timeout timers the same. As a first step, I'm leaning towards making the 8540 ARP timeout 4 hours as well then seeing how that changes things. Does anyone have recommendations for what they should be? Would DHCP lease time (2 hours), WLC/ISE session timeout (24 hours), etc. factor in what they should be? Keeping in mind we peak at around 40k clients, all client VLANs are shared across all 5 controllers, and most clients are on the VLAN with a single /16 subnet in it.
07-25-2024 04:37 PM - edited 08-14-2024 02:35 PM
TAC asked if we have our 8540s and 9800s in a mobility group, and if so, whether they share the same client VLANs. It's yes to both. They responded with this bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwc22005
They say the only way to avoid the behavior is to separate the client VLANs so the 8540s use one VLAN and the 9800s use another VLAN. I've inquired if splitting up the mobility group so the 8540s are in one group and the 9800s are in another group while the clients are still using the same VLAN for both would work.
@Leo Laohoo, if this is applicable, I wonder if it's contributing to any of the issues you've been having.
07-25-2024 08:23 PM
Thanks for sharing the Bug ID.
I'll need to do some digging around.
Our 8540 & 9800 are in the same Mobility Group and sharing the same VLAN but the number of clients doing inter-controller roaming is not in great numbers due to the geographical location of the APs.
08-14-2024 01:49 PM - edited 08-14-2024 02:36 PM
Update - TAC did say that splitting the mobility group could be another option. We attempted this by removing the 9800s from the 8540s' peer lists and vice versa without changing the mobility group name. Though it's at a much smaller scale due to being summertime at a university, we're still seeing the non-gateway/non-broadcast ARP replies coming out of the 8540s but not 9800s. So, it seems like that may not have fixed the issue. I'm waiting to hear back from TAC on next steps.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide