Solved: 802.1X failures sporadically for Windows 10 clients

Bwilson13 · ‎11-14-2022

Hello all,

We've been experiencing random authentication failures after a scheduled weekly reboot script runs at 3 am EST on Sunday mornings. The Windows 10 clients are using the native supplicant are a combination of Lenovo small form factor desktops and X1 Carbon laptops (G6 - G9) with docking stations. The clients are running Win10 Enterprise 21H2, and we have a two node ISE 3.1 cluster with patch update 3, and the access layer switches are 3850s running IOS XE 16.12.5b.

The Win10 clients have machine certs used for EAP-TLS authentication. The one thing that I was able to find in the TCPdump from ISE is a failed client wasn't sending the correct identity (MAC address opposed to devicename$@domain).

And the wired autoconfig log event ID 15514 reason text says there's something wrong with the user account. I interpret that to mean the machine account being used.

Is there anyone in the community experiencing strange behavior like this? Can this be an issue with the Windows supplicant after a reboot?

Arne Bier · ‎11-17-2022

When it says "N/A" it means that there is no session timeout for that endpoint session. This is unique to wired NAC because there is sometimes no need for a re-auth due to the fact that the device is physically connected to the switch and the link is up. The switch knows that the endpoint is still there - hence no need for a re-auth. It's different for devices that are daisy-chained to a deskphone - here the switch only sees the physical link to the deskphone, but not to the PC attached to the phone. Session timeout used to be a good mechanism to ensure that those types of endpoints were forced to re-auth to validate that they were still there. Cisco and other manufacturers desk phones will send a proxy logoff to the switch on behalf of the PC to avoid that situation. And of course in wireless NAC you cannot prove the client is still there - so session timeouts are used to maintain those sessions.

Bottom line - in wired NAC you don't need session timeouts and there is no default applied by the switch. The RADIUS server sends this optional timer value

View solution in original post

Arne Bier · ‎11-17-2022

I would go as far as saying that session timeouts for directly connected wired endpoints is a bad thing and not necessary. It can be a bad thing in the event that all connections to the RADIUS server are lost, and then the session counts down to zero. Unless you're using IBNS 2.0 you're gonna have a bad day. And IBSN 2.0 can only save you up to a point. Better still is to not use session timeouts on critical gear, so that you are never faced with a re-auth.

In theory, ISE should maintain the session (no matter how many months/years it exists for) by means of RADIUS Accounting updates. That tells ISE that the endpoint is still there. If this mechanism breaks down for whatever reason, then you have zombie sessions. I have seen those too. In that case, having a regular re-auth could avoid that, because you're forcing endpoints to re-establish those sessions.

View solution in original post

Arne Bier · ‎11-14-2022

Is it the switches that are rebooted at 3AM?

Your screenshot didn't show the whole story, but I would guess that what you got there was a MAB request, and not an EAP (802.1X) request. That is quite normal in my experience, especially when the IBNS 2.0 is used and MAB/802.1X is allowed simultaneously. You'd need to share more details about your Policy-Map (if using IBNS 2.0). Removing the MAB option from the initial event processing might reduce the chances of catching unwanted MAB events. Having said that, even in a sequential "802.1X first, then MAB" style processing, you can still catch a few MABs because a workstation might not respond to an EAP request from the switch in time, or it might be asleep (at 3AM) and not send anything until it's been woken up - and by then the 30 seconds of "EAP waiting" time is over, and the switch will process MAB again. It's an indeterministic thing - you can only hope for the best.

Bwilson13 · ‎11-14-2022

Hi Arne,

Thanks for the response!

All of our Win10 clients are behind Cisco 8851s phones. And we're still using IBNS 1.0 on our switches. Priority is dot1x then MAB. To be clear the reboots are happening to the PCs not the switches.

Bwilson13 · ‎11-14-2022

Arne - Just to add some more context here. I didn't have another option earlier to capture the EAPoL traffic between the client and the switch. Only could get tcpdump at ISE. However, I bounced the port twice of the affected client and the identity in radius still showed the MAC address.

Interface config:

interface GigabitEthernet1/0/21
description Corp-User
switchport access vlan 5
switchport mode access
switchport voice vlan 105
device-tracking attach-policy IPDT_POLICY
ip arp inspection limit rate 10
authentication control-direction in
authentication event fail action next-method
authentication event server dead action reinitialize vlan 5
authentication event server dead action authorize voice
authentication event server alive action reinitialize
authentication host-mode multi-auth
authentication order dot1x mab
authentication priority dot1x mab
authentication port-control auto
authentication periodic
authentication timer reauthenticate server
authentication timer inactivity server
authentication violation restrict
mab
trust device cisco-phone
snmp trap mac-notification change added
dot1x pae authenticator
dot1x timeout tx-period 10
auto qos voip cisco-phone
spanning-tree portfast
spanning-tree bpduguard enable
service-policy input AutoQos-4.0-CiscoPhone-Input-Policy
service-policy output AutoQos-4.0-Output-Policy
ip dhcp snooping limit rate 40
end

Arne Bier · ‎11-14-2022

One theory I have, is that after a reboot, some workstations take longer to come up (and to start the Wired supplicant Service) because they are processing a Windows Update? After the Ethernet link comes up, the switch will start its session timer and then expect to conclude the EAP transaction within a specified timeframe - if the bootup is slow(er) than usual, then the EAP timers expire, and then switch will process Ethernet frames as MAB.

Bwilson13 · ‎11-14-2022

Thanks, Arne!

My next step is to verify with capturing the traffic between the client/supplicant and the switch along with the ISE capture. I'd like to be sure this is not what everyone deems as an "ISE issue".

Arne Bier · ‎11-14-2022

What I don't understand though, is why the MAC address you showed in your initial posting is from Barco Projection Systems (they make video projectors) ? I would have expected something relating to a PC or a Dock.

Perform an endpoint debug in ISE as well - although they can be hard to read. Wireshark is possibly the best. You can also run debugs on the IOS-XE - it's not as easy as it was in the old days, but here are the commands you would use:

IOS-XE split out the 802.1X Session Management into a separate Linux process called 'smd' (Session Manager Daemon)

Forget everything you have learned in last 25 years of IOS debugging - it no longer works (the classic debug commands are still there, but they are defunct)

Set the debugs

===================

set platform software trace smd switch active R0 dot1x-all debug

set platform software trace smd switch active R0 radius-authen debug

set platform software trace smd switch active R0 aaa-authen debug

set platform software trace smd switch active R0 eap-all debug

set platform software trace smd switch active R0 auth-mgr-all debug

View the trace levels

=========================

show platform software trace level smd switch active R0

View the logs with

==============================

show platform software trace message smd switch active R0

After test complete, reset the debugs to normal again!!!

===========

set platform software trace smd switch active R0 dot1x-all notice

set platform software trace smd switch active R0 radius-authen notice

set platform software trace smd switch active R0 aaa-authen notice

set platform software trace smd switch active R0 eap-all notice

set platform software trace smd switch active R0 auth-mgr-all notice

Bwilson13 · ‎11-14-2022

I know it's odd but that's the Thinkpad dock's MAC address. And thanks for providing the debugs!!

Arne Bier · ‎11-14-2022

I would not have expected that MAC OUI prefix from a dock. Anyway.

If you're using Lenovo Docking Stations and Lenovo laptops, and if you have the supported product mix, then you should enable MAC Address Passthrough. I found that in my Lenovo X1 it was disabled by default. The benefit of enabling this is that the MAC address of the dock is then no longer used - the MAC address in the laptop BIOS is passed through onto the wire. It's a much better experience because your users might hot-desk - or in general, it's nice to see the real laptop MAC address - we don't care about MAC addresses of docking stations.

Greg Gibbs · ‎11-14-2022

This sounds strikingly similar to another recent discussion on this community...

Docking Station Best Practice with 802.1x Authentication and Cisco ISE

Since legacy IBNS (1.0) was mentioned, this could be related to the expected FlexAuth behavior discussed there.

Bwilson13 · ‎11-14-2022

Thanks, Gregg! Would FlexAuth be a factor here if the order and priority is dot1x then mab? Because that's the configuration across all host interfaces on my 3850s.

This is a standard topology without showing the docking station but it's between the X1 and 8851.

Greg Gibbs · ‎11-14-2022

It's the same session manager authentication engine, so I'm not sure if the same behaviour might be seen with both the order and priority using 'dot1x mab'. Most of the customers I've worked with (with legacy IBNS) used 'order mab dot1x' and 'priority dot1x mab' to avoid the delays for MAB endpoints.
It still might be worth trying to add the 'termination-action-modifier=1' to see if it makes a difference. These types of issues can be difficult to troubleshoot, so if that AVpair does not make a difference, you would likely need to look at getting debugs, packet captures, and potentially engage TAC.

Bwilson13 · ‎11-15-2022

Thanks, Gregg!

Bruce

Bwilson13 · ‎11-17-2022

Arne or Gregg - There's something that I found strange on the show auth session int details across my environment and was looking to confirm what this means.

It appears that with most Lenovo X1 laptop connected to a docking station the session timeout field is N/A but with every Lenovo desktop I can see the session timeout timer.

Lenovo desktop:

Lenovo X1 laptop connected to docking station:

Could this impact the switch's ability to know when the endpoint is connected and it's session has begun?

Thanks,

Bruce

Arne Bier · ‎11-17-2022

Session timeout will have been provided by ISE (Server) - if you look further into the show access-session output, you will see "Server Policies" - and perhaps something like the following

Server Policies:
      Session-Timeout: 28800 sec