Re: Azure Cloud deployment - fragmentation issues

Jagermeister · ‎01-21-2025

Hi all,

I have my ISE cluster deployed in Microsoft Azure and I have spent for over a week to troubleshoot a certain issue that I'm having. Unfortunately it seems that Azure is causing this and I'm wondering what everyone does to mitigate this.

The issue:

Access for 802.1X supplicants is very unreliable, sometimes it works but other times it fails 802.1X and the supplicants do not want to authenticate anymore. The 802.1X method is PEAP with EAP-TLS as inner-method. Checking the logs on the PAN are showing the following in these scenarios:

Failure Reason: 5440 Endpoint abandoned EAP Session and started new
Failure Reason 5411 Supplicant stopped responding to ISE

I can reproduce this behavior by simply cycling the switch port a few times, it just 'breaks' again.

After taking several pcaps and analyzing them, it seems that the traffic is fragmenting sometimes, which might make sense since I'm trying to achieve certificate based authentication. I've read multiple sources and it seems that Microsoft Azure is dropping out of order UDP traffic and that this can cause the two failure messages that I listed above.

I've decided to test this and deployed a PSN node outside of Microsoft Azure and joined this node to my cluster. All clients that I point towards PSN authenticate perfectly fine. The PAN's alert reports also only report that the dropped radius packets are from my Azure PSN and NOT from the locally deployed PSN.

Now I have read about the following options;

- Get Microsoft to enable 'allow out-of-order fragments' option

- Pin the subscription to ensure all instances within that subscription are deployed on hardware generation 7

The thing is that Microsoft seem to require that the Azure subscription needs to be empty. I think this is quite a PITA, since almost no one has an empty subscription and creating a new one is not always an option.

I'm wondering if something can be done outside of Azure to improve this situation. Would adjusting the MTU of the Cisco ISE PSN itself help for example?

Mark Elsen · ‎01-21-2025

- FYI : https://community.cisco.com/t5/network-access-control/eap-tls-to-azure-ise-is-failing-but-not-with-an-ise-node-in-the/m-p/4783440#M580104

M.

-- Let everything happen to you
   Beauty and terror
      Just keep going
     No feeling is final
Reiner Maria Rilke (1899)

Jagermeister · ‎01-22-2025

Thanks,

Seems that I cannot do much without Microsoft enabling that flag. Quite a bummer since its very inconvenient that the requirement is that the Azure subscription needs to be empty.

wifievangelist · ‎02-26-2025

This is not really a solution. The correct solution should come from Cisco, where they provide customers with the option to set the MTU size for the RADIUS Access-Request sent by the WLC to ISE. I am not sure why all the Cisco folks are pointing to Microsoft when there are certain things Cisco can address.

Greg Gibbs · ‎02-26-2025

See EAP Fragmentation Implementations and Behavior

There are multiple levels of fragmentation involved and one of the problems is that the Windows native supplicant uses large EAP messages (1470 bytes), which forces the IP fragmentation. This is a hardcoded setting which cannot be changed.
The result of the fragmentation is that the last packet is smaller, leading to a faster transmit, and therefore received out-of-sequence.

Cisco has no control over how the Windows supplicant behaves.

Greg Gibbs · ‎01-21-2025

There are multiple layers of fragmentation at play here. The main culprit is the large EAP messages used by the Windows supplicant. Changing the MTU on the PSNs will not make a difference for that.

See the discussion here for more details - https://community.cisco.com/t5/network-access-control/azure-packet-fragmentation/td-p/5205223

Jagermeister · ‎01-22-2025

Thanks for your answer,

So, fragmentation using a Windows supplicant is unavoidable if I understand correctly? I'm currently trying to get in touch with Microsoft support to see what they can do for me.

Another idea that I had is to to use Radius over TLS (Radsec), unfortunately it seems that ISE only supports Radius over DTLS, which is probably not solving this since its also UDP. I guess I can perform another test by deploying a PSN in one of the two regions that have the fix applied already and see if it works properly.

Do other public cloud providers also drop out of order UDP in your experience or is this just Azure?

Greg Gibbs · ‎01-22-2025

I worked with a large global resources company to migrate their ISE cluster from on-prem to AWS, which does not have this issue. As far as I know, MS is the only supported cloud provider that has this issue.

For customers I speak with that are planning the move to public cloud and have multi-cloud environments, I strongly recommend deploying ISE in AWS instead of Azure for this reason.

danielecappelletti · ‎01-24-2025

We also have this problem... We read on this Microsoft Q&A link below, that the paramater enable-udp-fragment-reordering is only possible to enable on a new subscription without any resources deployed on it: https://learn.microsoft.com/en-us/answers/questions/996062/azure-drops-my-udp-fragmentated-packets-when-they

Could you update this thread if you find a way to do this?

Thanks

PSM · ‎01-24-2025

@danielecappelletti Microsoft can only do that only some limited number of SKUs and not all. They can't do it for ISE machines SKUs

ccieexpert · ‎01-25-2025

If you have a few NADs, then you could build a IPSEC tunnel all the way from a NAD to ISE as shown below:

https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/222720-configure-ipsec-tunnel-between-cisco-wlc.html

this is with WLC, but switches like 9300 also support ipsec tunnels.. When you use a ipsec tunnel, the fragmented packets are encapsulated in the tunnel , so Azure doesnt see it (altleast that is what i think ) as the tunnel terminates on ISE. But this may not be scalable for 100s of NADs...

Also, you say it works sometimes and fails intermittenly.. i havent looked into the details of the problem of out of order. . Have you actually taken packet captures on the supplicant/NAD and also at Azure to see if they are truly out of order ? How do you connect to Azure ? do you have multiple paths which is causing out of order or is it because of the fragmentation process which on a device, which may cause a smaller fragment (compared to a larger) getting send first (and thus causing out of order ? it would be good to understand where it is going out of order and see if we can mitigate that.

Jagermeister · ‎01-27-2025

Thanks for your reply,

Today I have made some other captures but I'm starting to get a bit lost to be honest. I have made two pcaps; one on the office SD-WAN appliance (LAN interface), and one on the ISE PSN.

SD-WAN:

PSN:

taking radius session ID 153 ( identification frame 2e7a) for example

LAN SD-WAN:

ISE:

I think this suggests that the PSN in azure actually receives the UDP traffic and the Access-Challenge that is sent back seems also to be received on the LAN interface of the SD-WAN appliance. So, on the ISE logging the clients keeps failing with the following reason codes:

Failure Reason: 5440 Endpoint abandoned EAP Session and started new
Failure Reason 5411 Supplicant stopped responding to ISE

In the pcap's I do not see any access-accept message until many attempts have been made:

These attempts are all from the same supplicant, logs show over a 50 failed.

Any idea?

Scott Fella · ‎01-27-2025

You need to open a ticket with Azure engineering to enable that flag. I worked at MS and when I moved ISE to Azure, I had to have Azure engineering enable udp fragmentation. It's pretty quick and its not impacting, but they have to enable it on a virtual gateway. I posted this a while back when I got things working.

https://community.cisco.com/t5/network-access-control/eap-tls-to-azure-ise-is-failing-but-not-with-an-ise-node-in-the/td-p/4739038

** Update **

Would adjusting the MTU of the Cisco ISE PSN itself help for example? No... I tried that also:)

-Scott
*** Please rate helpful posts ***

Jagermeister · ‎01-27-2025

Hi Scott,

Thanks, I'm still waiting for Microsoft to respond. So, in my case I'm having an express route to Azure. Are you saying that I need an empty subscription AND that they have to enable it per express route / VPN gateway?

also, when you were having these issues, did it fail every single time or just regularly? In my case it does succeed sometimes but it fails a lot.

Scott Fella · ‎01-27-2025

Well this can only be done on a virtual gateway, in your case the VPN gateway. So if your ISE cube is deployed over multiple virtual gateway's, then you need to have each on touched. We didn't have an empty subscription, we had a few hundred items in that subscription which I was worried about, but once they enabled that flag, took a few seconds, there was no glitch, no tickets and things started to work.

The fragmentation was only on EAP-TLS, PEAP was working fine. I tested this with a VM on-prem vs in Azure so I can at least have a baseline of what was working and packet captures to compare. EAP-TLS failed every time no matter what device we tested with. We tested with PEAP just for this use case, but everything needed to be using EAP-TLS to pass auth in production. We didn't allow PEAP at all. Keep in mind, once you get an engineer, they will need to escalate that to a tier 1 engineer to do the work.

Just reference this to the engineer:

https://learn.microsoft.com/en-us/answers/questions/996062/azure-drops-my-udp-fragmentated-packets-when-they

-Scott
*** Please rate helpful posts ***