cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
379
Views
0
Helpful
6
Replies

With ISE 3.3.4 and Win 11 PKI 802.1x EAP failure goes into blocking

wags
Level 1
Level 1

When we cycled our ISE 3.3p4 deployment for p4 updates, specifically when cycling the PSNs our network took a big hit that took about a week to fully recover from. We are still in a bad way since what we've found/identified in our environment is still present, although we've made things much better with tweaks to ISE.   This was happening to some extent prior to p4, but this cycle really hurt us.

So the first thing we found from wireshark was that Windows 11 will stop responding to EAP (go into blocking) after a single failure response from the network. This seems wrong to me, machines should at least try 2 or 3 times before backing off into blocking.

Although we could not find a lot on this subject, this link describes what we feel is happening for Windows 11 (going into backoff/blocking).
https://learn.microsoft.com/en-us/answers/questions/636616/eap-failure-need-to-modify-length-of-block-timer-(

We see that in 20 minutes Win 11 returns to try again. Note that the link is discussing Win 10 defaults, which seem to also be set in Win 11. A lot changed in Win 11 around EAP, especially around things Win 10 ignored and Win 11 not.

However, during that 20 minutes of the blocking quiet time the switch as a proxy for ISE tries EAP 3 times to the Win 11 machine that is in blocking. At the point of 3 EAP failures, the switch/ISE seems to go into blocking as well. This presents a true chicken and egg situation WRT blocking. The devices both need to return in a tight window of time, and with Win 11 one-strike and you're out, it is almost impossible.

I also found that in our Win 11 with ISE 3.3p4 environment that the PCs were also going into rejected status in ISE. We changed ISE configuration so that we pretty much do not reject at all, which has helped. Reject requires human intervention to "clean up ISE" so that the machine can process through 802.1x port security.

We believe that we have reduced the number of EAP failures by changes to ISE that improve the response time from the PSNs. However we will never be able to remove the occasional failure packet because of excess queueing during a PSN recycle. At least we believe that is what happened to us.

We seem to have come to this point of instability with Win 11 migration, however we also acquired and migrated to new ISE hardware with associate new software about the same time. In the past we could do PSN cycles one at a time (actually the entire deployment) with "no hits". That was and is really nice, however more recently things have not been anything that smooth.


What we see our problems are and appreciate thought for:
Do we know if IOS-XE will translate an excessively long 802.1x/PKI cert/RAIUS responses into an EAP failure? If so, what is that value and where is the "tuning knob"?

How to change Win 11, apparent default setting, so that it does not go into EAP blocking after a single EAP failure packet?

Of course we network folks do not have access to the insides of Win 11, so anything as specific as possible to ask our Win 11 folks?

Has anyone encountered anything like this, or can tell us where to tune Windows 11 and ISE to bring us back to resilience we've had in the past with ISE and our network?

Anyone with thoughts about the situation as I described above?

TIA

Further data:
-ISE 3.3p4, 4 PSNs split up 2 active and 2 backup as defined in the IOS configuration (server order). We easily handle the load during normal operations with 2 active servers, to include multple large 9k chassis switch cycles.
-About 75K end nodes, about half phones, half PCs.
-About 1100 switches mostly 9K IOS 17+, but some 4K, 3K and misc. larger and smaller devices.
-Win 11 with current patch levels.

 

6 Replies 6

Arne Bier
VIP
VIP

Do we know if IOS-XE will translate an excessively long 802.1x/PKI cert/RAIUS responses into an EAP failure? If so, what is that value and where is the "tuning knob"?

The switch (Authenticator) is in the middle of the EAP communication between supplicant (Win11) and the Authenticating Server (ISE) - the authenticator does not interpret or interfere with the EAP packets - the EAP failure comes from ISE in all cases. The only exception to this is when ISE is not reachable/responding - IOS-XE has a feature that will send the EAP Success to the supplicant during Critical Auth.

How to change Win 11, apparent default setting, so that it does not go into EAP blocking after a single EAP failure packet?

Not sure but I have not googled around for a solution. My question would be, why are endpoints receiving an EAP Failure at all?  What is the cause of these failures?  Are you doing certificate auth, or is there MSCHAPv2 involved?  With cert auth, you should not be getting "intermittent" failures - it either works or it doesn't at all.

Understand that we have gotten the problem slowed down enough, and people "self fixing", that we are having dificulty "catching" this. The following is based on an extremely limited sample.

-Why are endpoints receiving an EAP Failure at all: unsure, however we have seen that ISE generally has "Rejected per authorization profile" for the enpdoint. We can power cycle the PC (multiple times) and may eventually get it to "take off". However when they call: The switch display of auth session is unknown unauth for the device on the port and if there is a phone it is voice auth. On the switch we can clear auth session for that session number (multiple times) and it comes back with new session number in state of unknown unauth. It does not seem to show up in ISE as having attempted reauthed based on reports for that endpoint (like the switch/authenticator is stuck). However, if we shut the interface and then noshut, the session does everything correct and everything is good to go. Generally, the escalation to us has taken some time, it could be our "fix" is and accidnet that is time dependent, and we don't have that metric.

-What is the cause of these failures: Not sure, but ISE indicates, Rejected per authorization profile. also see immediately above answer for more detail.
-Are you doing certificate auth: yes, PKI, authentication order and priority dot1x mab

-I assume that the ISE logging may be messed up because we've seen that several times with our 3.3 deployment. I am personally uncomfortable trusting the ISE logs and displays in 3.3.
-We have seen some long ISE response time entries in monitoring dashboard analytics from time to time.
-We also see that the PSNs seem to be slowing increasing their memory used as though there is a memory leak.

These intermittent authentication problems are amongst the most frustrating and difficult to troubleshoot. However, it's very hard to resolve these kinds of issues in these forums based on this level of information provided. My general recommendation in this case would be to really look into the Live Logs details and to see why the endpoint failed.

Are you sure it was an EAP Failure (i.e. something wrong with the 802.1X process), or was it a MAB failure? MAB failures are possibly normal and expected if you only want to authorize your Windows PCs using EAP-TLS, and for some reason, the PC network adapter sends out a non EAPOL frame that ISE will process as a MAB. How should you treat these MAB events for such devices?  I would NEVER Access-reject them, because that breaks the Session on the switch. You should Access-Accept them always, and then return a restrictive dACL from ISE - allow DHCP and ping (for example) but block everything else. And if your are using IBNS 2.0 and have concurrent DOT1X and MAB authentication, then 50% of those auths will always fail, because you can't have both succeed. It's advisable to rather perform sequential auth - 802.1X first, then MAB etc. If at all possible.

I can tell you from having looked at this for many years, that the switch IOS is usually never the cause of any of these issues. It's more likely that the switch config may be sub-optimal - e.g. in IBNS 2.0 one could forget to include the inactivity timer policy, or the policy to clear sessions when inactivity timer has fired. Or any manner of logical errors.

More about troubleshooting 802.1X (EAP) issues

The most common cause of 802.1X issues is the endpoint itself. I am assuming that the certificate is valid, and that the supplicant is configured as needed (e.g. computer authentication, using EAP-TLS or whatever) - in the wired world, and Windows in particular, you'd be best to look at the following:

  • Windows Event Viewer Logs - locate the "Wired AutoConfig" Service that operated the 802.1X supplicant in Windows - see if/why the 802.1X is unhappy.
  • If the Windows PC is a laptop, on a dock, then update the dock firmware and see if that improves things
  • Update the Ethernet device drivers (Windows updates usually tries to get the latest for your devices - but scrutinise the version details - and if Windows update is way behind, then get the updates from the laptop vendor support page
  • In ISE, ensure that you disable ALL Logging Suppression - you don't want to suppress any repeated success/failed auths - that will give a more reliable view of events in the Live Logs. It might also increase the amount of logging but for troubleshooting you need reliable information
  • For Desk phones, I would recommend to NOT send a Session Reauth in the ISE Authorization Profile - phones stay connected to the switch 24/7 and you can rely on Accounting updates for their status and ISE licensing. Re-auth for such devices adds no value. UNLESS - these phones are using 802.1X EAP-TLS and you want to keep an eye on their certificate lifespan (which you will see with every auth). In my opinion, that is the only value I can see of constantly re-authing a wired endpoint.

 

  

Arne Bier, thank you for all the input and suggestions. I must admit that trying to let the young pups take the reins has put me at a disadvantage with ISE 3.3. Retirement potentially on the radar after decades of CCNP and Ciscoisms and "turning over the reins" is very difficult for me on many levels in cases like this.

What do you make of the following information?

A failure that I worked with recently where the PC was "Rejected per authorization profile" and would not respond to switch clear auth session. After an interface cycle the port/device did in fact authenticate with 802.1x/PKI certs (switch and ISE both agree on that). Within reason "no PCs are in our MAB table".  But... Why was it work fine a day or so ago (weekend)? Why the heck did it work just fine with an interface cycle vs clear auth? Maybe the phone, but the phone is not preset in all cases (multiple issues?)?

Valid cert in above case....it works after an interface cycle and the PC would not have gotten a new cert from AD since the port has been offline to that PC (and it was not a new cert). After interface cycle, session is dot1x, data, auth on switch and ISE shows same. 

And the most recent set of "good luck bad events". We had multiple Aruba APs me switch which use MAB fall into failure on the same switct. I understand that cert and MAB are totally different, but....

The APs will reboot themselves every 5 or so minutes if they cannot access their "master controller" which is not the switch like Cisco, but a server in the data center. So, the interface is going down and then back up every 5 minutes or so. Log messages on the switch confirm this.

-We can issue sh auth sess int xxx detail and we never see the method transition past 802.1x/running.
-While the auth session is in a stuck condition we can clear auth sess sess ID and the session ID changes.
-No matter what we did on the switch (we did not cycle the interface) it would not "recover".
-We went to ISE and deleted the "MAB entry".  
-ISE never "repopulated" a profiled and rejected ISE entry. We left this for several AP reboot which caused multiple interface down/up.
-Leaving ISE alone, we did other commands on the switch to include editing out interface commands (someone left voice auth "stuff" on an AP port) on the device's interface which had no effect.
-During this time, the switch MAC address table has the MAC address in drop status as expected.
-Finally we shut/no shut the port and then.....
-ISE gets a profiled entry for the device on the port in question. This should have happened during the AP reboot caused interface recycles (as far as I am aware).
-ISE does not fully allow the AP to auth since we do not dynamically allow APs to do that.
-A quick click on the profiled entry and placement into an AP defined identity group and "magic". Things are running as expected for the device.
-All displays seem to agree MAB data auth.

Having read the above essay, my gut instinct says that it's either the CoA that's not working, or there is an issue with the IBNS 2.0 Policy (if you're using IBNS 2.0). NAC has many moving parts, and requires an awful lot of config to get it running smoothly.

I think it's time we got some technical facts on the table. Are you able to provide output of the following commands? Just obfuscate the names if you need to, but don't remove output:

show run | sec radius
show run | in aaa
show derived interface <example_interface>
show access-session interface <example_interface> detail

If you're using IBNS 2.0 then please also share your policy-map

Do you have Device-Tracking enabled on NAC interfaces?

Do you have Device Sensor Enabled on the switch, and does it populate (and do you rely on this for profiling in ISE?  If so, check that DHCP snooping is enabled on relevant VLAN(s) if you rely on DHCP probe data in ISE)

What does the relevant ISE Authentication Policy look like? (screenshot)

What does the relevant ISE Authorization Policy look like? (screenshot)

Can we get a Live Logs Details screenshot to see the processing?

I'm happy to take a look at that to see if I can spot an issue.

One of the best Cisco reference sources is the Wired Prescriptive Guide, which I check almost on a weekly basis to remind myself of these things. If you haven't already checked it out, take a look - it might also inspire you with some ideas.

I will see what we can do, and again thank you.   Also new data from yesterday's testing.   If we delete the NAS from ISE, so no session with any PSN since we do not use default NAS.  The switch will not correctly perform fallback for dead server on the interface that is "hung up".  Note that the "hung" interface in this case is an Aruba AP using MAB which reboots (interface down/up in sh log) every ~5 minutes.  For other interfaces, deleting the auth sessID or just natural "aging timer pops" will "open" the interface in dead server mode to sessions.   My bias right now is an IOS-XE issue for these specific symptoms.  Also feel we have multiple issues all adding up into an unstable environment.  C9410R  Version 17.9.5