Solved: Re: 9120 AP getting disconnected from WLC (9800-CL)

sroic · ‎09-08-2022

Hello,

we have an issue with our APs getting disconnected from WLC with this message:

The AP "AP-NAME" is CAPWAP disconnected from WLC and is no longer joined to it. The WLC has missed the AP’s CAPWAP heartbeat message.

They connect immediately back but it raises a lot of false positive alerts which makes it hard to recognize when something is actually down. We weren't even aware of this until we connected WLC to the DNA.

From what we see in AP log it seems this is correlated with "Bad Address" message. This one should be caused by "incompatible" DHCP and it's a familiar bug on WLC:
https://community.cisco.com/t5/wireless/catalyst-9800-cl-microsoft-dhcp-bad-address-entries/td-p/4528643
https://community.cisco.com/t5/cisco-bug-discussions/ap-or-controller-reload-causing-windows-dhcp-bad-address/td-p/4608249

Although we don't run windows DHCP but Forti firewall as DHCP server. And we have same setup on several locations but it seems it's happening only on one. Also, don't see the alternative to missing command "config ap dhcp release-override" on 9800-CL and we still don't understand why would that even cause an issue with hearbeat on the WLC.

We tried looking into increasing some timers/heartbeats alerts to avoid this issue but didn't find appropriate method. In AP join profile > Capwap > HA > there are is a "Heartbeat timer" set up to 30 sec by default. Was thinking about raising this but don't really understand how would it help, when the lease expires I assume it will trigger the same alert.

Anyone has same issues, some possible workaround?

Bad address log which correspondents to time when the alert comes in:

Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.3001] wl: Bad Address
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.3401] DOT11_DRV[1]: DFS CAC timer enabled time 60
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.3501] Enabling Channel and channel width Switch Announcement on current channel
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.3581] DOT11_DRV[1]: set_dfs Channel set to 100/40, CSA count 6
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.4901] set cleanair [slot0][band0] enabled
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.5221] set cleanair [slot0][band1] enabled
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.5861] set cleanair [slot1][band1] enabled
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.7941] wl: Bad Address
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.8591] wl: Bad Address
Sep 8 11:50:22 kernel: [*09/08/2022 11:50:22.8971] changed to DFS channel 100l, running CAC for 60 seconds.
Sep 8 11:50:23 kernel: [*09/08/2022 11:50:23.0801] wl: Bad Address
Sep 8 11:50:23 kernel: [*09/08/2022 11:50:23.3151] wl: Bad Address
Sep 8 11:50:23 kernel: [*09/08/2022 11:50:23.6881] wl: Bad Address
Sep 8 11:50:26 kernel: [*09/08/2022 11:50:26.7241] wl: Bad Address
Sep 8 11:50:27 kernel: [*09/08/2022 11:50:27.0571] wl: Bad Address
Sep 8 11:50:29 kernel: [*09/08/2022 11:50:29.6691] wl: Bad Address
Sep 8 11:50:29 kernel: [*09/08/2022 11:50:29.9301] wl: Bad Address

Rich R · ‎09-14-2022

Back to your original problem which you suspect might be to do with DHCP:
- get a debug (radioactive trace) on the AP connection to see exactly what's happening with the CAPWAP state
- get a packet capture of the CAPWAP at the same time on the WLC.
- get a packet capture of the DHCP on the AP port
That will allow you to see whether it's even related to DHCP or not and understand what the problem is then take it from there. If the problem is the way your DHCP server deals with the triple release then you'll need to use another server or fix the server. From the description I don't think it's related to the triple release problem and if it's only one site affected then it's more likely packet loss somewhere. To prove that get a packet capture of the CAPWAP at the AP port at the same time as you capture it on the WLC and you'll see if any packets are going astray between the AP and WLC when you see the problem happen. If there is packet loss then you'll need to trace the path and work out where that's happening and fix it.

ps. we recently had a site where DDOS "protection" was rate limiting the CAPWAP causing APs to drop off ... (thresholds set far too low by people not understanding what CAPWAP is).

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs

View solution in original post

marce1000 · ‎09-08-2022

- Start by reviewing the WLC (9800-CL) configuration with the CLI command : show tech wireless , have the output analyzed by https://cway.cisco.com/tools/WirelessAnalyzer/ , please note do not use classical show tech-support (short version) , use the command denoted in green for Wireless Analyzer. Checkout all advisories! (first)

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Leo Laohoo · ‎09-08-2022

What firmware is the controller on?

sroic · ‎09-09-2022

@marce1000- I have run it, we have one error and couple of warnings from the APs that just deployed and not configured fully yet. All others are info messages from production APs with nothing really related to this issue. I can paste obscured output if needed.

@Leo Laohoo- we are currently at 17.6.3, we were on a 17.3.4 till few weeks ago. Same issue was on that version but we have hit a separate bug. That one also wasn't fixed upgrading fw and our TAC case suggested we upgrade again to 17.6.4. So soon we will do that also, but I'm pretty confident it won't solve this issue, maybe just the other one

Leo Laohoo · ‎09-09-2022

Ask TAC if the bug is fixed in 17.6.4. If it is, go to 17.6.4.

sroic · ‎09-09-2022

TAC said it will fix the other bug we have which is more serious because it crashes APs. So we have to go to 17.6.4 anyway but I doubt it will fix this issue. If it doesn't we will open a separate TAC, don't know what other options we have

sroic · ‎09-09-2022

@marce1000 I went to fix the issue from config analyzer by reconfiguring those new APs I though were causing the error message. But after I did and rerun the analyzer some warnings are gone but this error is still here:

20034

CAPWAP: Invalid configured controller IP.

APS: More than 10 APs affected, use standalone tool for more details

Action: The IP address configured in the AP controller list, does not match the address of the current controller. Possible invalid configuration that should be corrected

I thought my APs are missing controller names and IPs in this menu:

But it seems this message is about something else. What is exactly "AP controller list", how can I check the values there and update them? And what is this "standalone tool" they mention for more details. Didn't manage to find explanation how to actually fix this error

marce1000 · ‎09-09-2022

>...The IP address configured in the AP controller list,

I am not sure about that , this could be related to N+1 HA setups, if that is not needed check AP configuration in the controller and remove all controller addresses (for N+1 availability)

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

marce1000 · ‎09-09-2022

- FYI : https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvy30144

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

sroic · ‎09-09-2022

We actually use only N+1 setup, no real (L2) HA between primary and secondary WLC, so we need this setting but it's a bit confusing what would be wrong in our setup here.

Regarding this bug, I have already stumbled upon it but it's only about wrong AP name when this alert comes in. Not really a bug with the alerts themselves. Which makes sense, DNA here is reporting correctly that the APs got disconnected just can't find out why.

Thanks for the feedback anyway, we will try to upgrade fw asap and then probably open a new TAC case for this issue

marce1000 · ‎09-09-2022

>....We actually use only N+1 setup,

In that case make sure that all access points have primary and secondary controller configured with correct ip addresses from the involved controllers

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Rich R · ‎09-14-2022

Back to your original problem which you suspect might be to do with DHCP:
- get a debug (radioactive trace) on the AP connection to see exactly what's happening with the CAPWAP state
- get a packet capture of the CAPWAP at the same time on the WLC.
- get a packet capture of the DHCP on the AP port
That will allow you to see whether it's even related to DHCP or not and understand what the problem is then take it from there. If the problem is the way your DHCP server deals with the triple release then you'll need to use another server or fix the server. From the description I don't think it's related to the triple release problem and if it's only one site affected then it's more likely packet loss somewhere. To prove that get a packet capture of the CAPWAP at the AP port at the same time as you capture it on the WLC and you'll see if any packets are going astray between the AP and WLC when you see the problem happen. If there is packet loss then you'll need to trace the path and work out where that's happening and fix it.

ps. we recently had a site where DDOS "protection" was rate limiting the CAPWAP causing APs to drop off ... (thresholds set far too low by people not understanding what CAPWAP is).

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs

sroic · ‎09-15-2022

Hi rrudling, thank you for your feedback. Until we upgrade I would like to get those captures you recommended but would need some help.

- get a debug (radioactive trace) on the AP connection to see exactly what's happening with the CAPWAP state

So for radioactive trace I need to specify some MAC/IP address in order to record it. Do I specify radio MAC of some AP I'm interested?

- get a packet capture of the CAPWAP at the same time on the WLC.

This one I can trigger manually (Troubleshooting > Packet Capture option I assume) but I don't know when the issue will occur. It happens randomly on a set of 30 APs at the location.

- get a packet capture of the DHCP on the AP port

This one I didn't manage to find out how to do. Cannot do "#ap packet capture ..." command as found out online. Any advice would be appreciated.

Also in the WLC Troubleshooting sections I have some "Debug Bundle", did you meant to use that one maybe?

I managed to do a radioactive trace (used radio MAC of one AP) and here is a short part of log from one AP where I found heartbeat expiration. What is weird is that I have an alert in DNA from today at 3:35 AM saying the AP got disconnected and missed heartbeat message. But I did a trace for last 7 days and there is no events of this type except the one at 10th of September

2022/09/10 11:23:48.008334 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 11:38:48.139823 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 12:08:48.836230 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 12:23:48.225003 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 14:38:49.199809 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 16:08:48.928332 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 16:53:49.082408 {wncd_x_R0-0}{1}: [cleanair-airquality] [16599]: (ERR): 1006.edef.8320num of interferer count in aq:0 max:5
2022/09/10 17:20:12.642545 {wncd_x_R0-0}{1}: [ap-join-info-db] [16599]: (note): MAC: 1006.edef.8320 AP disconnect initiated. Reason: Heart beat timer expiry, Phase: Run
2022/09/10 17:20:12.642572 {wncd_x_R0-0}{1}: [ewlc-infra-evq] [16599]: (ERR): 1006.edef.8320 CAPWAPAC_HEARTBEAT_EXPIRY
2022/09/10 17:20:12.642573 {wncd_x_R0-0}{1}: [capwapac-smgr-sess-fsm] [16599]: (note): Mac: 1006.edef.8320 Session-IP: 172.17.1.62[5273] 10.99.100.8[5246] Last Control Packet received 91 seconds ago.
2022/09/10 17:20:12.642574 {wncd_x_R0-0}{1}: [capwapac-smgr-sess-fsm] [16599]: (note): Mac: 1006.edef.8320 Session-IP: 172.17.1.62[5273] 10.99.100.8[5246] Last Data Keep Alive Packet information not available. Data session was not established
2022/09/10 17:20:12.642792 {wncd_x_R0-0}{1}: [ewlc-dtls-sess] [16599]: (note): Remote Host: 172.17.1.62[5273] MAC: 1006.edef.8320 DTLS session destroy
2022/09/10 17:20:12.642946 {wncd_x_R0-0}{1}: [ewlc-dtls-sess] [16599]: (note): Remote Host: 172.17.1.62[5273] MAC: 1006.edef.8320 DTLS session destroy
2022/09/10 17:20:12.642987 {wncd_x_R0-0}{1}: [capwapac-smgr-sess] [16599]: (note): MAC: 1006.edef.8320 CAPWAP-IDB-Delete: AP Tunnel delete send success. IFID : 90000064
2022/09/10 17:20:12.643104 {wncd_x_R0-0}{1}: [capwapac-smgr-sess] [16599]: (note): MAC: 1006.edef.8320 FMAN CAPWAP AP Tunnel Delete Successful, IFID : 90000064
2022/09/10 17:20:12.643624 {wncd_x_R0-0}{1}: [capwapac-smgr-sess] [16599]: (note): Mac: 1006.edef.8320 Session-IP: 172.17.1.62[5273] 10.99.100.8[5246] AP DISCONNECT msg sent to loadbalancer
2022/09/10 17:20:12.643755 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:0 BSSID idx:0, IFID : 0, is successful.
2022/09/10 17:20:12.643793 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:0 BSSID idx:1, IFID : 0, is successful.
2022/09/10 17:20:12.643827 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:0 BSSID idx:2, IFID : 0, is successful.
2022/09/10 17:20:12.643857 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:0 BSSID idx:3, IFID : 0, is successful.
2022/09/10 17:20:12.643903 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:1 BSSID idx:0, IFID : 0, is successful.
2022/09/10 17:20:12.643959 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:1 BSSID idx:1, IFID : 0, is successful.
2022/09/10 17:20:12.643992 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:1 BSSID idx:2, IFID : 0, is successful.
2022/09/10 17:20:12.644020 {wncd_x_R0-0}{1}: [apmgr-bssid] [16599]: (note): MAC: 1006.edef.8320 Delete bssid opctxt for Radio:1 BSSID idx:3, IFID : 0, is successful.
2022/09/10 17:20:12.644024 {wncd_x_R0-0}{1}: [sanet-shim-miscellaneous] [16599]: (ERR): capwap interface ifid is NULL. dropping the terminate callback:
2022/09/10 17:20:12.644390 {wncd_x_R0-0}{1}: [apmgr-ap-global] [16599]: (note): MAC: 1006.edef.8320 AP SM Purge. AP deregister complete
2022/09/10 17:20:12.652483 {wncd_x_R0-0}{1}: [capwapac-smgr-srvr] [16599]: (ERR): Failed to get capwap session from remote tuple
2022/09/10 17:20:12.653415 {wncmgrd_R0-0}{1}: [ewlc-infra-evq] [16194]: (note): Msg type :mesg->msgtype Priority:1
2022/09/10 17:26:24.636820 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 IP:172.17.1.62[5276], Discovery Request received
2022/09/10 17:26:24.636843 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 EWLC TAGS payload is not present in discovery request
2022/09/10 17:26:24.637076 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 Public IP learnt is TRUE, public IP discovery is FALSE, private IP discovery is TRUE.
2022/09/10 17:26:24.637154 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 IP:172.17.1.62[5276], Discovery Response sent
2022/09/10 17:26:24.639373 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 IP:172.17.1.62[5276], Discovery Request received
2022/09/10 17:26:24.639386 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 EWLC TAGS payload is not present in discovery request
2022/09/10 17:26:24.639455 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 Public IP learnt is TRUE, public IP discovery is FALSE, private IP discovery is TRUE.
2022/09/10 17:26:24.639510 {wncmgrd_R0-0}{1}: [capwapac-discovery] [16194]: (note): MAC: 1006.edef.8320 IP:172.17.1.62[5276], Discovery Response sent
2022/09/10 17:26:34.064055 {wncd_x_R0-0}{1}: [capwapac-smgr-srvr] [16599]: (note): MAC: 1006.edef.8320 DTLS session create callback received.
2022/09/10 17:26:43.940461 {wncd_x_R0-0}{1}: [capwapac-smgr-sess] [16599]: (note): MAC: 1006.edef.8320 Received CAPWAP join request

Chad S · ‎09-16-2022

How are you doing your site tags? I believe it has been resolved in more recent versions of code, but older versions definitely had an issue where if too many APs or APs that had too many clients were all piled into the same site tag, then all of that processing would have to be done by the same WNCD process.

The command below can help determine if your WNCD processes are getting hammered:

show process cpu platform sorted | incl wncd

sroic · ‎09-20-2022

Hi Chad, I totally missed your reply, thanks for the advice. We are still rolling this out by offices so I wouldn't say we have reached those limits. Currently only around 40 APs running and the output of this command looks ok (early working hours here):
wlc-a#show processes cpu platform sorted | include wncd
16599 16264 1% 1% 1% S 480196 wncd_0

Meanwhile we have upgraded one WLC to 17.6.4 and moved some APs there to see if it helps fix the issue. Also we had an incident with our ISP at that troublesome location so we rerouted traffic a bit differently and I'm starting to think @Rich R was right that this doesn't have to do anything with DHCP but instead some packet loss was happening in their mpls network. Will update status after few days