Re: Corp users are getting stuck in 8021X_REQD

Noovi · ‎03-16-2023

Hello Team,

we have Cisco 5520 WLC and we have upgraded WLC to 8.10.183.0 image version.

After this upgrade, few region started complaining issue in connecting and as per invistigation we found that user are stuck in 8021X_REQD state.

When we checked logs in ISE, we found error like 'suplicant stopped responding to ISE'

We already checked with CISCO TAC from wireless and ISE end but no any findings from them.

Anyone has similar issues at your end?

marce1000 · ‎03-16-2023

- Below you will find the output of your attached debug file when processed with https://cway.cisco.com/wireless-debug-analyzer/ ,
(I used the flag Show All ) I would look into things like disable fast roaming settings on the WLAN if applicable , update the client Wifi (NIC) drivers if not using the latest , :

Mar 16 19:35:51.343	*Dot1x_NW_MsgTask_3	WLC/AP is sending EAP-Identity-Request to the client
Mar 16 19:35:51.382	*Dot1x_NW_MsgTask_3	Client sent EAP-Identity-Response to WLC/AP
Mar 16 19:35:51.382	*aaaQueueReader	Radius request with ID 150 sent to 172.28.139.138.
Mar 16 19:35:51.384	*radiusTransportThread	Radius request with ID 150 sent to 172.28.139.138.
Mar 16 19:35:51.427	*aaaQueueReader	Radius request with ID 151 sent to 172.28.139.138.
Mar 16 19:35:51.434	*radiusTransportThread	Radius request with ID 151 sent to 172.28.139.138.
Mar 16 19:35:51.502	*aaaQueueReader	Radius request with ID 152 sent to 172.28.139.138.
Mar 16 19:35:51.503	*radiusTransportThread	Radius request with ID 152 sent to 172.28.139.138.
Mar 16 19:35:51.563	*aaaQueueReader	Radius request with ID 153 sent to 172.28.139.138.
Mar 16 19:35:51.564	*radiusTransportThread	Radius request with ID 153 sent to 172.28.139.138.
Mar 16 19:35:51.623	*aaaQueueReader	Radius request with ID 154 sent to 172.28.139.138.
Mar 16 19:35:51.624	*radiusTransportThread	Radius request with ID 154 sent to 172.28.139.138.
Mar 16 19:35:51.686	*aaaQueueReader	Radius request with ID 155 sent to 172.28.139.138.
Mar 16 19:35:51.688	*radiusTransportThread	Radius request with ID 155 sent to 172.28.139.138.
Mar 16 19:35:51.731	*aaaQueueReader	Radius request with ID 156 sent to 172.28.139.138.
Mar 16 19:35:51.733	*radiusTransportThread	Radius request with ID 156 sent to 172.28.139.138.
Mar 16 19:35:51.864	*aaaQueueReader	Radius request with ID 157 sent to 172.28.139.138.
Mar 16 19:35:51.865	*radiusTransportThread	Radius request with ID 157 sent to 172.28.139.138.
Mar 16 19:35:51.910	*aaaQueueReader	Radius request with ID 158 sent to 172.28.139.138.
Mar 16 19:35:51.912	*radiusTransportThread	Radius request with ID 158 sent to 172.28.139.138.
Mar 16 19:35:52.041	*aaaQueueReader	Radius request with ID 159 sent to 172.28.139.138.
Mar 16 19:35:52.042	*radiusTransportThread	Radius request with ID 159 sent to 172.28.139.138.
Mar 16 19:35:52.102	*aaaQueueReader	Radius request with ID 160 sent to 172.28.139.138.
Mar 16 19:35:52.106	*radiusTransportThread	Radius request with ID 160 sent to 172.28.139.138.
Mar 16 19:35:52.153	*aaaQueueReader	Radius request with ID 161 sent to 172.28.139.138.
Mar 16 19:35:52.163	*Dot1x_NW_MsgTask_3	RADIUS Server permitted access
Mar 16 19:35:52.163	*Dot1x_NW_MsgTask_3	Client will be required to Reauthenticate in 43000 seconds
Mar 16 19:35:52.163	*Dot1x_NW_MsgTask_3	4-Way PTK Handshake, Sending M1
Mar 16 19:35:52.217	*Dot1x_NW_MsgTask_3	4-Way PTK Handshake, Received M2
Mar 16 19:35:52.217	*Dot1x_NW_MsgTask_3	4-Way PTK Handshake, Sending M3
Mar 16 19:35:52.269	*Dot1x_NW_MsgTask_3	4-Way PTK Handshake, Received M4
Mar 16 19:35:52.269	*Dot1x_NW_MsgTask_3	Client has completed PSK Dot1x or WEP authentication phase
Mar 16 19:35:52.269	*Dot1x_NW_MsgTask_3	Client has entered DHCP Required state
Mar 16 19:35:54.552	*emWeb	Client delete code: Multiple triggers That can be due to possible reasons: Received a CCX RM request from a client with CCX version lower than 2/ Radius server sent a disconnect request (RFC3576, etc)/ On some scenarios of client blacklist (administrator request)/ For HTTP profiling scenarios, after a vlan change, so policies can be reapplied, or when received policies have a different session timeout, from the client session timeout/ WLAN is deleted or disabledIn PMIPv6, MAG notified to delete the client/ Administrator request a client delete by CLI/GUI
Mar 16 19:35:54.552	*emWeb	Client expiration timer code set for 1 seconds. The reason: Dissasociation or deauthentication received from client, this is valid on 802.11w scenario. Also, generic termination clause, reason would be provided by pervious log message
Mar 16 19:35:55.398	*apfReceiveTask	Client session has timed out
Mar 16 19:35:55.398	*apfReceiveTask	Client disassociation event has occured. Possible reasons may be due to AP Radio Reset usually due to channel change or wlan was manually disabled or Client unable to get valid DHCP IP for WLAN using DHCP required
Mar 16 19:35:55.398	*apfReceiveTask	Client has been deauthenticated
Mar 16 19:35:55.398	*apfReceiveTask	Client session has timed out
Connection attempt #1
Mar 16 19:35:58.490	*apfMsConnTask_0	Client roamed to AP/BSSID BSSID 24:36:da:13:db:f6 AP CN-07928ap-04
Mar 16 19:35:58.490	*apfMsConnTask_0	The WLC/AP has found from client association request Information Element that claims PMKID Caching support
Mar 16 19:35:58.490	*apfMsConnTask_0	The Reassociation Request from the client comes with 1 PMKID
Mar 16 19:35:58.490	*apfMsConnTask_0	WLC cannot find a valid PMKID to match the one provided by the client. However, if the client performs OKC and not SKC, the WLC computes a new PMKID based on the information gathered (the cached PMK, the client MAC address, and the new AP MAC address)
Mar 16 19:35:58.490	*apfMsConnTask_0	Client is entering the 802.1x or PSK Authentication state
Mar 16 19:35:58.490	*apfMsConnTask_0	Client has successfully cleared AP association phase
Mar 16 19:35:58.490	*apfMsConnTask_0	WLC/AP is sending an Association Response to the client with status code 0 = Successful association
Mar 16 19:35:58.526	*Dot1x_NW_MsgTask_3	Client will be required to Reauthenticate in 43000 seconds
Mar 16 19:35:58.526	*Dot1x_NW_MsgTask_3	WLC/AP is sending EAP-Identity-Request to the client

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Scott Fella · ‎03-16-2023

Just to add, you should always run a diff between the old configuration and the post upgrade configuration. This will show you what might have been added or something that might have been disabled or set back to default. Hopefully you have a backup config that you can run diff against.

-Scott
*** Please rate helpful posts ***

Rich R · ‎03-16-2023

And a bug which @Leo Laohoo pointed out to me https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwe07802 which is fixed in the next maintenance release due out in the next week or two - ask TAC about that. (what AP model are you seeing this on?)

And of course as others said make sure your WiFi drivers are updated to the LATEST version. I say that because quite often people say "my drivers are up to date" (because Windows update hasn't offered anything new) but the driver they're using is 2 years older than the one on Intel web site. So if it's Intel then look at https://www.intel.com/content/www/us/en/download/19351/windows-10-and-windows-11-wi-fi-drivers-for-intel-wireless-adapters.html for example. (The earlier versions of those drivers are riddled with bugs)

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

Noovi · ‎03-16-2023

i think this is the bug which is affecting. Issue is intermittent and randomly coming and pointing to EAP authentication.

let me work with TAC for next fix release

Rich R · ‎03-16-2023

TAC should be able to give you a copy of the latest beta if you're willing to test it.

------------------------------
Please click Helpful if this post helped you and Select as Solution (drop down menu at top right of this reply) if this answered your query.
------------------------------
TAC recommended codes for AireOS WLC's and TAC recommended codes for 9800 WLC's
Best Practices for AireOS WLC's, Best Practices for 9800 WLC's and Cisco Wireless compatibility matrix
Check your 9800 WLC config with Wireless Config Analyzer using "show tech wireless" output or "config paging disable" then "show run-config" output on AireOS and use Wireless Debug Analyzer to analyze your WLC client debugs
Field Notice: FN63942 APs and WLCs Fail to Create CAPWAP Connections Due to Certificate Expiration
Field Notice: FN72424 Later Versions of WiFi 6 APs Fail to Join WLC - Software Upgrade Required
Field Notice: FN72524 IOS APs stuck in downloading state after 4 Dec 2022 due to Certificate Expired
- Fixed in 8.10.196.0, latest 9800 releases, 8.5.182.12 (8.5.182.13 for 3504) and 8.5.182.109 (IRCM, 8.5.182.111 for 3504)
Field Notice: FN70479 AP Fails to Join or Joins with 1 Radio due to Country Mismatch, RMA needed
How to avoid boot loop due to corrupted image on Wave 2 and Catalyst 11ax Access Points (CSCvx32806)
Field Notice: FN74035 - Wave2 APs DFS May Not Detect Radar After Channel Availability Check Time
Leo's list of bugs affecting 2800/3800/4800/1560 APs
Default AP console baud rate from 17.12.x is 115200 - introduced by CSCwe88390

Mikulasik · ‎03-16-2023

I have this bug too, seems to affect all devices, but is not consistent. Issue occurred over the weekend.

Scott Fella · ‎03-16-2023

My question would be, did you run into this issue because you upgraded or did you finally notice that you were having user issues? There will always be bugs and the biggest take back is if you upgrade and users finally tell you that wireless sucks after a few weeks or months, then revert back. Users tend to find their fixes or let's say work arounds until it becomes a pain in their rear ends. I have done so many upgrades with testing and you will always run into one upgrade that bites you in to butt. The best way is not to wait for a fix and then upgrade to find out it's still broke or another issue happens, revert back and do further testing. At the end of the day, you can't blame the vendor for a bug, because management will always look at the person or team that made the change.

-Scott
*** Please rate helpful posts ***

Mikulasik · ‎03-16-2023

We were on 8.10.170 since it came out and ran into this issue on Monday (hmm DST happened Sunday). Upgraded to 183 based on TAC advice, no change. My debugs look exactly the same as the OP, the bug is logged this week. I have a 3504 controller with users hittings the same NPS server policy with no issues, but it runs 8.5. Why it would run fine for about a year, then screw up like this, I don't know, but at this point it must be a WLC bug.

Scott Fella · ‎03-16-2023

Things just don't break. You need to look at patches on the Windows device that can also tend to break things. Upgrade of NIC firmware can also introduce issues. So you have to go back a month or so and see what was pushed and try to isolate the issue. New devices can also look like something just broke, but a bunch of users just got their laptop refreshed. Its best to gather data on the devices though some device management management system that can help with you correlating NIC model types and firmware along with patches to see what might of caused the issue. In all case, take time to reboot the controller or fail it to another controller to see if the issue goes away. Even though the controller seems okay, it just might not be. I have seen that too many times, just like folks whom never shut down their laptops and eventually its slow, has issues connecting ,etc.

-Scott
*** Please rate helpful posts ***

Mikulasik · ‎03-28-2023

The root cause was Azure fragmenting and delivering packets out of order from the NPS server. We needed to get Azure to enable UDP Fragment reordering as this behavior is by design.

https://github.com/MicrosoftDocs/azure-docs/issues/69477

Scott Fella · ‎03-28-2023

I ran into this also a few months back and keep in mind that Azure engineer will enable this on an Azure virtual network for the subscription. If you have multiple rescue groups and need this feature, you will need to request them to enable this flag. If you create a new virtual network gateway, you will need to open a ticket to have them enable this flag.

I saw issue with ISE in Azure with only EAP-TLS and fragmentation when using an OTA capture.

https://community.cisco.com/t5/network-access-control/eap-tls-to-azure-ise-is-failing-but-not-with-an-ise-node-in-the/td-p/4739038

-Scott
*** Please rate helpful posts ***

JPavonM · ‎03-17-2023

Please look for clients where OS and/or drivers have been upgraded like @Scott Fella said, if something has been working consistently during the last months, and failures have appeared to all clients with a set of specifications (Intel on this case) look for the problem on that side.

I'd recommend you to subscribe to Intel communities where you can post the errors and work with Intel engineers into tracking down the issue and possibly fix it. In parallel, othe wNIC vendors do have known connectivity and performance issues under Windows such as Realtek and Mediatek so look always for the most up-to-date driver in MS Catalog Update, there are some scripts that do this for you only for drivers, search for them in Google.

Mikulasik · ‎03-17-2023

I'd entertain assuming it was just Intel if it wasn't the same behavior on Apple and Android devices.

Leo Laohoo · ‎03-17-2023

We are currently investigating an Intel-related wireless NIC driver issue where the NIC would drop association if the SSID is configured for WPA2 Enterprise. Dropouts with PSK will also occur but not as frequent with WPA2 Enterprise.

The matter first observed after a large fleet of ChromeBooks (CB) were having irregular dropouts. We brought this issue with Google and Google tapped Intel. Intel confirms issue with the NIC drivers.

According to Google, the issue is due to the GTK regeneration where the driver is unable to handle.

We suspect all drivers, up to 22.150.3 are affected.