Solved: Re: ISE Failures - MAB instead of 802.1x

adam85491 · ‎02-15-2018

Hello,

I've been struggling with an issue in our ISE deployment for months. Basically, we are trying to restrict wired network access for computers by looking for 802.1x and then authorizing if the CA issuer for the machine cert is our internal CA.

Here's what the Authentication Policy looks like:

802.1x: if Wired_802.1X & Allowd Protocols (EAP-TLS) & Default: Use 8021x_Seq

Authorization Policy:

Domain Computer: If 'Any' and EAP_TLS_CA_Issuer (our CA) then PERMIT_ALL_PROFILE

I've uploaded images of these policies as well.

What is happening is randomly Win7 and Win10 clients are not using dot1x authentication (which would use their PC name as the username) and instead are using their MAC address as the username and matching the MAB rule (which will fail). These PCs tend to do this in the morning and after a half hour or so, they start working again. I've noticed successful authentication, then the user shuts down or reboots and there is failures overnight and into the next business day. I've attached a copy of an authentication where you can see it bounce between MAB and dot1x.

What can be causing this? The interface config is below:

switchport
switchport trunk allowed vlan 1
switchport mode access
switchport access vlan xxx
switchport voice vlan yyy
authentication control-direction in
authentication event fail action next-method
authentication event server dead action authorize vlan xxx
authentication event server dead action authorize voice
authentication event server alive action reinitialize
authentication host-mode multi-auth
authentication order dot1x mab
authentication priority dot1x mab
authentication port-control auto
authentication periodic
authentication timer reauthenticate server
authentication violation restrict
mab
dot1x pae authenticator
dot1x timeout tx-period 7
storm-control broadcast level 0.50
storm-control multicast level 0.50
spanning-tree portfast edge

This is a 6880 switch running 15.2.1.SY5

i'm only starting to get familiar with ISE so this could be an incorrect config on the ISE or switch side, but we have 1000+ endpoints and only see this happening to a few people per week. It seems random and I haven't found anything in common as far as Windows versions go. It's affected HP desktops and laptops, but I haven't yet kept track of NIC driver versions to see if maybe something is going on there.

TAC has me check the adapter settings in windows and for GPOs and a valid certificate on the machine. Each time I do so, everything looks normal. We've gotten packet captures but only of successful authentication. Our local resource has to reboot the PC to get a full capture and by the time we do this, the reboot seems to have fixed the issue (it doesn't always, usually the user reboots a few times before we get our resource to them and the issue persists...just bad luck it seems on our end).

I may not have included all of the information needed to solve this. Please let me know if I need to add more. I'm searching everywhere and see suggestions like missing hotfixes for Win7 or machine password timeouts, but not sure that's my answer at this time.

I'd appreciate any help on this.

Adam

Adam · ‎08-11-2020

Unfortunately I never really did. The problem faded as we got away from that model and got everyone onto Windows 10. I no longer have access to that environment to get further detail

View solution in original post

Octavian Szolga · ‎02-17-2018

Hi Adam,

Regarding auth faillures overnight, I suspect that your PCs have Wake On LAN support enabled. If that's the case, even though the PC is shut, the link will be brought up by WOL. Not sure about the MAC address, but I've seen similar cases.

I would pay attention to reauthentication timers and dot1x timeout. Yours seems a little bit aggresive. Maybe it takes longer for Windows to start the 802.1x service, or maybe the GPO is not correctly configured and you need Domain Controller connectivity to apply the GPO, including the wired config service. (seen this also)

Do a test with a PC - Power on/off a PC several days with no network connectivity (so that no controller can be reached to apply the GPO) and check if the wired config service is running. If it's not running it means that the dot1x settings pushed by GPO means have to be refreshed each time and are not permanent.

Thanks,
Octavian

Adam · ‎02-19-2018

Thank you for your reply.

Regarding Wake on LAN - we had similar thoughts and are going to check to see if that setting is enabled. Most of our users are on laptops but this office serves as a call center with primarily desktops, so it is possible that this is a concern we overlooked.

I did get feedback from TAC that our authentication timers may be a little aggressive, so we have tested tuning some of them back.

I can also get a test PC off the network for a few days to verify the service and make an conclusion about the GPO.

I'll report back with my findings.

Adam

Adam · ‎02-22-2018

My local resource showed me some screenshots of the PC settings and they look alright, meaning the GPO seems to be fine. We're awaiting the results of testing an offline PC still.

I see a client failing right now (after business hours) and I know it's a workstation based on the MAC. I've attached a screenshot of the failure.

Regarding timers, can you suggest which I should be looking at changing? I've been reading a deployment guide to get a better sense of best practices as well.

Additionally, I see that MAR is selected with a timeout of 5 hours. I am wondering the impact of this and perhaps the purpose? I understand it can be used when one desires to do machine/user authentication with the native supplicant (we are not doing user authentication) but I am unsure that it is interefering here or not.

Thank you

Octavian Szolga · ‎02-23-2018

Hi,

If it's business hours maybe he shut his PC and the reauth timer just kicked in.

Can you please check if you have a reauth timer set from ISE for 802.1x?

Regarding timers, I usually use

dot1x timeout tx-period 10

For GPO, you could reload the PC without any network connection and check if it wired 802.1x service is running.

MAR is just a poor man's pc and user authentication. It should kick in only when you use 'machine was authenticated' authorization condition.

Thanks,

Octavian

Adam · ‎02-23-2018

Thank you for your reply.

I followed up and asked for that GPO test again. I may just go out to the site myself next week to do some more troubleshooting.

I will extend the timeout to 10. I have extended the radius server deadtime to 30 seconds as 2 seconds was causing tons of failures in the logs. Both of these were set by a consultant.

Regarding ISE reauthentication timer, I see this:

Adam

Octavian Szolga · ‎02-23-2018

Hi Adam,

So basically you have an idle-timeout which is useful for scenarios in which the endpoint is not directly connected to the switch, i.e. using a VoIP phone. If the phone is not Cisco then you don't have any means to check if the PC is still connected. After the timeout expires, you have to reauth the device.

If your PCs are directly connected you can safely remove the timeout.

Radius dead-timer should be set somewhere at 15s with 2 3 retries. 2 seconds is to aggresive because ISE waits in turn for the AD. So it may not be an ISE fault but a backend/AD issue.

Thanks,

Octavian

Adam · ‎02-26-2018

This has been very helpful. I just want to make sure I'm understanding the differences between all of the timeouts so I can understand their potential impact and select the best option:

radius-server deadtime 2 --> I thought this was the number of seconds the server had to respond, but instead I'm thinking it's how long a server is considered dead before it's health is checked again

radius-server dead-criteria time 15 tries 3 --> Does this relate to the time permitted during an actual authentication attempt?

dot1x timeout tx-period 7 --> This is on the individual interfaces. I'm not clear how this differs from the previous timer

I was seeing some RADIUS server dead/alive messages in the logs and am not sure why since they're staying up and responding to ping.

I can confirm the Wired LAN service stayed running without network connection and after a reboot. I can also confirm all Wake on LAN settings are disabled.

Thank you!

Adam

edondurguti · ‎02-26-2018

were they connecting using dotx1 instead of mab after the power options were disabled?

Adam · ‎03-08-2018

I have not seen any difference after observing the power settings or testing an offline PC with the wired LAN service running.

What I do find interesting is the packet capture I took via SPAN session:

-The ISE logs indicate a MAB failure

-The SPAN session shows repeated "Request, Identity"from the authenticator (switch) followed by eventual "Response, Identity"from the supplicant (PC)

-There's a number of EAP failures (code 4); I'm unsure why there would be more than 2 (my understanding is the default command is dot1x max-req 2 so I thought it would only try twice before failing)

I've attached a sanitized screenshot of a wireshark capture

Brian Taylor · ‎06-11-2018

Any conclusion?

adam85491 · ‎06-12-2018

Not much of an ending to this story. We essentially were in the process of migrating that office away from the desktop platform they were on and onto the notebook platform the rest of the company was on. Between this and the completion of the Windows 10 upgrade, we saw the noise reduce greatly. I never clued onto any Wake on LAN settings but I was interested to see if people were locking their PC or putting it to sleep instead of rebooting or shutting down. My local resource rarely made it in early enough for consistent troubleshooting.

One frustrating thing I saw was Identity, Response messages from the client. This is surprising when we would only see MAB attempts in the RADIUS Live Logs in ISE. My TAC guy tried telling me I was looking at MAB authentication in the packet capture despite how obvious it was it 802.1x (it even says so right in the packet). Eventually he agreed and then said I was out of luck since I did not have corresponding logs from the ISE side. This spiraled into a never ending chase for info that never seemed to be enough (I'd collect all sorts of stuff, and then get asked for even more. I'd get that, then was told an additional debug on the switch was needed, and then again even more was required, etc.).

It was a frustrating experience. My suggestions would be to bypass suppression for a specific client (you can do this in the RADIUS Live Logs instead of globally) and to extend the timers. There may be some merit to playing with the authentication priority and order as well.

If you can repeat the problem, then get TAC on the phone and you will end up needing a capture from the switch and from ISE as well as debugs on both. This is after you prove the endpoint has a correctly configured supplicant and appropriate certificate.

Adam

anandelumalai · ‎08-07-2020

Hello Adam,

Did you find the root cause for this issue?

Awaiting for your response.

Thanks,

Anand

Adam · ‎08-11-2020

Unfortunately I never really did. The problem faded as we got away from that model and got everyone onto Windows 10. I no longer have access to that environment to get further detail

anandelumalai · ‎08-12-2020

Hello Adam,

Thanks For the update !

We are facing this issue in multiple windows 10 PC.