11-28-2022 05:22 AM - edited 11-28-2022 05:38 AM
Hello Cisco WLAN Community,
we are struggling here using a new 9800-L-C HA-Anchor WLC running 17.3.4 for guests in our hospital.
The Foreign WLCs are 2 5520-WLcs and 1 9800-80.
During peek-times we see around 3000+ users on this box. But suddenly, the whole traffic can stop immediately and Clients are lost on DHCP Discover phase. At this moment, we see a lot of WLAN clients in "IP LEarn" state on this SIngle-Point-of Failure.
We can clearly see from Wireshark-Traces done on the next switch that DHCP is working and answers are coming from DHCP-Server. But the traces on the 9800-L-C only show DHCP Discover packets ?!?
See also attached pictures showing low traffic on eduroam. Can also happen on @BayernWLAN by the way, where You can see still high traffic and also DHCP in both directions.
At the moment we recover from this by rebooting the 9800-L-C.
To offload traffic, I'm using two old 2504-WLC-Guest-WLCs from good old times which we run in parallel,
to bridge traffic coming from the 5520-WLCs as a workaround.
But they can only host 1000 users and 2504-WLCs cannot talk Mobility to the brandnew 9800-80 any more.
Case is already open with Cisco TAC SR 694582039 Helpdesk#14257100
The 9800-L-C Guest solution is constructed from two 9800-L-C running in HA-mode.
My question would be, how can i break the HA and go back to use the two physical boxes in parallel again.
I estimate that there is a performance probem on this box.
According to datasheet, a performance license would be available to increase from 500 APs to 1000 APs and from 5000 users to 10000 users. Do You think that this would cure the problem ?
In case of license violation, I would have expected an alarm, but the box is malbehaving from one seond to the other without warning.
Is there anybody out there experiencing same porlbem or can advise wht to do ?
Kind regards
Wini
11-28-2022 06:06 AM
This looks like a recent issue I had with stale entries on internal DB. We are trying to investigate in deep with Cisco as there is no current workaround.
To see if this is what is happening to you, check "show wireless device-tracking database ip" for dupplicated MAC addresses with different IPs (you can try to find it using Excel sheets). Cisco BU has shared with me a script to check that with some outputs but I prefer not to place it here.
If this is your case, I recommend you to open a TAC case.
11-28-2022 06:15 AM
Just to add... if you want to break HA, you will have to remove the configuration and both controllers will have to reboot. Once they come back online, you will need to configure the standby controller as that will not have a configuration. Keep a console on both so you can see what is happening, just in case something breaks. I don't use HA in production and that is my preference from experience. I will only test HA in a lab environment. As far as the performance license, its always tricky with anchors, but there are also limits when it comes to client count vs ap count. If you are not hitting that number, then I would not worry about it, but if you are close, then yes, I would probably purchase that license.
11-29-2022 05:44 AM - edited 11-29-2022 05:45 AM
Hello Scott and JPavonM,
during troubleshooting today I found also "Dirty VLANs" on the 9800-L ?!?
WLC-9800-Guest#show wireless vlan details
Vlan inforamation: <- Here is a typo in Your software by the way. Please correct.
-------------------------------------------------------------------
DirtyTime displays time remaining for dirty vlan to become non-dirty
Dirty-Counter displays the number of times a vlan has become dirty
-------------------------------------------------------------------
Process Vlan Dirty Dirty-Counter DirtyTime(mm:ss)
-------------------------------------------------------------------
0 760 Yes 1206 28:44
0 763 Yes 1186 29:57
0 764 No 18 0
0 4080 No 0 0
It looks like the sytems marks VLans dirty and blocks DHCP for 30 minutes.
Here the info from the Config -Guide:
Information About VLAN Groups
Whenever a client connects to a wireless network (WLAN), the client is placed in a VLAN that is associated with the policy profile mapped to the WLAN. In a large venue, such as an auditorium, a stadium, or a conference room where there are numerous wireless clients, having only a single WLAN to accommodate many clients might be a challenge.
The VLAN group feature uses a single policy profile that can support multiple VLANs. The clients can get assigned to one of the configured VLANs. This feature maps a policy profile to a single VLAN or multiple VLANs using the VLAN groups. When a wireless client associates to the WLAN, the VLAN is derived by an algorithm based on the MAC address of the wireless client. A VLAN is assigned to the client and the client gets the IP address from the assigned VLAN.
The system marks VLAN as Dirty for 30 minutes when the clients are unable to receive IP addresses using DHCP. The system might not clear the Dirty flag from the VLAN even after 30 minutes for a VLAN group. After 30 minutes, when the VLAN is marked non-dirty, new clients in the IP Learn state can get assigned with IP addresses from the VLAN if free IPs are available in the pool and DHCP scope is defined correctly. This is the expected behavior because the timestamp of each interface has to be checked to see if it is greater than 30 minutes, due to which there is a lag of 5 minutes for the global timer to expire.
Controller marks VLAN as dirty when the clients are unable to receive IP address using DHCP. The VLAN interface is marked as dirty based on the Non-Aggressive method. That is, when only one failure is counted per association per client and controller marks VLAN as a dirty interface only when three or more clients fail
The colleague who configured thsi box left us already. He configured also VLAN groups. I don't know why.
But according to Config-Guide the system might not clear the Dirty flag even after 30 minutes for a VLAN group.
Question of mine
How can I get rid of "dirty VLANs on the 9800-L-WLC ?
Thank You for Your help
Kind regards
Wini
11-29-2022 06:43 AM
Can you check if the dhcp is full on those vlans? i don't think you can really clear them especially if devices are still failing here and there to get a dhcp address. Maybe its the lease time that is set on the dhcp, you can try to lower that, but not too low either. If the dhcp is getting full and 100% utilized, you will need to rethink and decide how to increase that. Maybe one big subnet for guest is a better choice.
11-30-2022 12:52 AM - edited 11-30-2022 03:07 AM
Hello Scott,
thank You for Your remarks regarding DHCP-ranges.
The internal ranges for "eduroam" used for worldwide roaming in universities and hospitals are big enough.
The external DHCP-range for @BayernWLAN", a free Internet service of Bavarian government is also big enough.
What we see is trouble on mobiles using "Private MAC-Addressing" to hide their identity. Same devices appear with different MACs during the day on our big campus and eat up several IPs during the day while leaving and entrering different buildings.
I can also see on the 9800-L, that Apple-clients moving between "@BayernWLAN" and "eduroam" try to renew their obsolete
"@BayernWLAN"-IP-address while connecting to "eduroam" instead of asking for a new eduroam-IP first.
What about the script that JPavonM mentioned together with the command:
"show wireless device-tracking database ip" to look for duplicates with different IPs.
Is this script generally available somewhere ?
I used notepad++ to cleanup and sort the output instead.
I see a lot of duplicates in our 9800-L-Guest-WLC !!! Same MAC show up with ipv4 and ipv6-addresses. See attached picture.
Looks like Windows clients that create this mess.
I haven't activated ipv6 on the 9800-L-C in the SVI of those Vlans.
How can I get rid of ipv6-entries in the 9800-L-database to avoid violation of 5000 user max border ?
Are these ipv6-adresses generated by Windows itself or is it derived from our DHCP-Server ?
Kind regards
Wini
11-30-2022 05:03 AM
17.3.4 - consider 17.3.6 + APSP or 17.6.4 just to make sure you eliminate any known bugs which have already been fixed.
I suspect your dirty vlans are just a side effect of whatever goes wrong when clients stop getting IP addresses.
I wouldn't call the v6 entries duplicates though - it's perfectly valid for a device to have both v4 and (multiple) v6 addresses (dual stack).
Those are probably locally assigned addresses not from DHCP server (but I have not looked at the details).
Check that the load is evenly spread across your wncd (controller) processes? Your site tags are used to achieve this balancing - detailed in best practices guide. Use "show process cpu platform sorted | incl wncd" to check the wncd load.
If you have everything on a single wncd and it hits 100% then things will start to go wrong.
11-30-2022 05:15 AM - edited 11-30-2022 05:18 AM
Hello Richard,
thank you very much for Your opinion to this and the nice command and also the final signature with all relevant links to 9800-stuff.
I used the command and found only one process:
WLC-9800-Guest#show process cpu platform sorted | incl wncd
24196 23606 4% 4% 4% S 421328 wncd_0
But I tried also
WLC-9800-Guest#show process cpu platform
and found high CPU load inone out often cycles:
CPU utilization for five seconds: 33%, one minute: 34%, five minutes: 33%
Core 0: CPU utilization for five seconds: 5%, one minute: 6%, five minutes: 8%
Core 1: CPU utilization for five seconds: 6%, one minute: 8%, five minutes: 8%
Core 2: CPU utilization for five seconds: 5%, one minute: 6%, five minutes: 8%
Core 3: CPU utilization for five seconds: 50%, one minute: 54%, five minutes: 51%
Core 4: CPU utilization for five seconds: 90%, one minute: 95%, five minutes: 94%
Core 5: CPU utilization for five seconds: 46%, one minute: 47%, five minutes: 43%
Core 6: CPU utilization for five seconds: 49%, one minute: 55%, five minutes: 52%
Core 7: CPU utilization for five seconds: 99%, one minute: 99%, five minutes: 99%
Core 8: CPU utilization for five seconds: 0%, one minute: 0%, five minutes: 0%
Core 9: CPU utilization for five seconds: 0%, one minute: 0%, five minutes: 0%
Core 10: CPU utilization for five seconds: 0%, one minute: 0%, five minutes: 0
Interesting also:
23886 23062 337% 356% 344% S 350332 ucode_pkt_PPE0
More than 100 % What does this mean ?
Is this high load due to bug
IOSXE SW-DP Router Shows High CPU in ucode_pkt_PPE0
CSCve73211
Kind regards
Wini
11-30-2022 06:02 AM
Ah so 9800-L only runs a single wncd! Each wncd runs on a single CPU core. But at 4% it doesn't look like that's your problem.
I wouldn't worry about the >100% itself - that's a consequence of processes running across more than 1 core so they can effectively use more than 100% = 1 core.
But that ucode does look high and you have some cores running very busy too - I would ask TAC about that. That bug is saying it's normal for it to be higher (and is Terminated so not something that will change) but I don't know what the normal range would be.
01-25-2023 05:43 AM
Hello Cisco-WLAN-experts,
our 9800-L-C used as Guest-WLC has been upgraded to Version 17.3.5b in the meantime.
After some quiet days during years-end and some cosmetic changes like SVi- VLAN-group-deletions another crash happend last week at around 3200 users.The sudden decrease in consumed bandwidth on the WAN-link seems as if the box is longer transferring packets at all. Not only DHCP-packets in failure situation.
After a reload we are running with around 4200 users and I can see around 100 "Stale" entries in the db
using the command show wireless device-tracking database ip | incl Stale
The majority of these entries are "Stale IPv4 DHCP"-entries. The rest are "Stale ARP"-entries.
Does anyone know about a possible db-problem on the platform ?
What are the reasons for "Stale entries" and how can we reduce them ?
And the big question: What commands shall be used in failure situation to isolate the problem ?
Where are the big troubleshooters on the new Cisco-9800-platform?
Kind regards
Wini
01-25-2023 06:59 AM
I think your only option is TAC.
Hopefully over time they'll provide more detailed troubleshooting guides but the 9800 documentation overall seems woefully inadequate to me at the moment. Some seems to have been copied straight from AireOS and some is even missing info that was in the AireOS documentation.
03-19-2024 03:31 PM
CSCwh75934
CSCwh10989
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide