Re: Azure Packet Fragmentation

InfraISE2020 · ‎10-08-2024

Hi all,

We deployed ISE in Azure back in March (version 3.3.0.430) with the following setup:

- 4 x ISE servers (PAN/PSNs)

- ExpressRoute from on-premise to Azure

- Meraki APs

We noticed we have hundreds of clients stopped responding errors every day and when we look further the main error is "12935 Supplicant stopped responding to ISE during EAP-TLS certificate exchange". Quite a few online posts suggest this is an MTU sizing issue where Azure drops fragmented packets.

The deployment guide suggests that this is a known issue with DMVPN and SD-WAN connections and the fix is to contact Microsoft support for them to allow "out-of-order fragments" option.

We logged this with Microsoft and apparently this isn't just a case of enabling a setting and the fix is to create a brand new subscription, use gen 7 VMs and route traffic via the internet!!! Obviously this isn't viable as our connection to Azure has to go via our Express Route circuit!

The guide suggests this has now been fixed in East Asia and West Central US however nothing has changed in UK South.

Has anyone else come across a similar issue and managed to get the issue resolved without the things they suggested to us?

Also is there anywhere in ISE where we can prove that Azure is dropping fragmented packets so we can go back to our account manager with evidence?

TIA.

Greg Gibbs · ‎10-08-2024

See EAP Fragmentation Implementations and Behavior

There are multiple levels of fragmentation involved and one of the problems is that the Windows native supplicant uses large EAP messages (1470 bytes), which forces the IP fragmentation. This is a hardcoded setting which cannot be changed.
The result of the fragmentation is that the last packet is smaller, leading to a faster transmit, and therefore received out-of-sequence.

I'm not sure I understand why MS is stating that the traffic has to be routed via the internet, but the only way to verify that the issue is due to dropped packets is to take a packet capture on each side of the connection (client and ISE) and compare them.

InfraISE2020 · ‎10-11-2024

Hi @Greg Gibbs thanks for the reply.

Are you aware of any other customers who are experiencing the same issues as it sounds like it's a common issue when deploying ISE in Azure? I noticed on the recent deployment guide that it refers to a fix by Microsoft in certain regions, do you know what the actual fix is? We've escalated this with our account manager at Microsoft but it would be good to understand if others are having the same issue as its a big problem for us and we cannot resolve it at the moment.

Cloud Deployment Guide

Due to this known issue, do one of the following:

Select regions where Azure Cloud has already implemented the fixes: East Asia (eastasia) and West Central US (westcentralus).
Cisco ISE customers should raise an Azure support ticket. Microsoft has agreed to take the following actions:
1. Pin the subscription to ensure all instances within that subscription are deployed on hardware generation 7.
2. Enable the "allow out-of-order fragments" option, which allows fragments to pass through to the destination instead of being dropped.

Greg Gibbs · ‎10-13-2024

My understanding is that pre-Gen 7 hardware is unable to reassemble the out-of-sequence fragments properly, but Microsoft would have to confirm that is the case.

Any customer running ISE nodes in Azure with EAP-TLS flows would have this issue. I've had customers with multi-cloud environments deploy ISE in AWS instead of Azure as AWS does not have this issue.

InfraISE2020 · ‎10-24-2024

Hi @Greg Gibbs ,

We aren't having much joy with Microsoft, there suggested fixes are not applicable and we are unable to migrated to AWS as all our infrastructure is in Azure.

I've seen articles suggest setting the MTU on the PSN interface to 1300 and other sites suggest adding the framed-mtu to the authz profiles but i'd like to get more information before we start making random changes.

I can see packet fragementation on our Azure inferface on our fortinet firewall but our networking team are telling us its normal to see fragmentation on L3 network and that there is nothing we can do to re-order the packets before sending to Azure to ensure they are in order and do not get dropped?

Its incredibly frustrating as Microsoft are saying its nothing to do with them and the Cisco guide implies some fixes have been applied in certain regions but it would be good to understand exactly what has been changed.

We migrated ISE to azure based on the Cisco deployment guide and it was suggested that its a simple fix from Microsoft when it doesn't appear to be that way.

Greg Gibbs · ‎10-24-2024

You could certainly try those options, but I'm not confident they will make any difference. None of them will change the fact that the Windows native supplicant uses large EAP messages (1470 bytes), which forces the fragmentation at the IP layer.

The combination of these two factors (expected fragmentation and dropping out of sequence fragments) results in the problem.

FWIW, the OSX supplicant appears to use 1270 byte EAP messages, so Apple appears to have a better grasp on basic networking than MS. I know this doesn't help the situation you're having though.

pritamCTC · ‎03-03-2025

@Greg Gibbs

https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-tcpip-performance-tuning#azure-and-fragmentation

Cisco Bug id updated on 3rd Feb, 2025

https://bst.cisco.com/quickview/bug/CSCwe82033

I am not sure under this bug why it's only mentioned Network Access Device (Access Point or Switch), when we are doing the EAP-TLS with central auth that time NAD device should be WLC. is not it? is this something incorrectly mentioned in the bug ID?

pritamCTC · ‎07-17-2025

Choose the regions where Microsoft Azure Cloud has already implemented the fixes: Central Canada (CanadaCentral), Central France (FranceCentral), Central India (CentralIndia), Central Poland (PolandCentral), Central Sweden (SwedenCentral), Central UAE (UAECentral), East Asia (eastasia), East Australia (australiaeast), East Canada (CanadaEast), East Japan (japaneast), East Norway (NorwayEast), East US (eastus), North Central US (northcentralus), North Germany (GermanyNorth), North EU (EUNorth), North Switzerland (SwitzerlandNorth), North UAE (UAENorth), South Africa North (SouthAfricaNorth), South Brazil (brazilsouth), South East Asia (southeastasia), South East Australia (australiasoutheast), South India (SouthIndia), South UK (uksouth), West Central US (westcentralus), West Central Germany (GermanyWestCentral), West UK (ukwest), and West US (westus).

Cisco Document updated on 15th July, 2025

https://www.cisco.com/c/en/us/td/docs/security/ise/ISE_on_Cloud/b_ISEonCloud/m_ISEonAzureServices.html

InfraISE2020 · ‎11-07-2024

Hi @Greg Gibbs ,

We have passed this information onto Microsoft and have also spoken to someone else at Cisco and they have said something similar regarding the hardware.

Are you aware of anyone currently using ISE in the East Asia and West Central US regions who don't experience fragmentation issues in Azure?

I don't suppose you have any more information on the supposed hardware "fix" do you?

Greg Gibbs · ‎11-07-2024

@InfraISE2020, I have only personally worked with customers to deploy ISE in AWS. I don't have visibility of any specific customers that have deployed ISE in Azure for these regions.

I'm not aware of any MS documentation that specifically states what is changed in the newer hardware that resolves this issue. It stands to reason that it would be something in the fragmentation reassembly code and/or hardware (like the ASIC).

pritamCTC · ‎02-26-2025

@InfraISE2020, @Greg Gibbs can you please share the latest update on this is this already resolved from MS side or still ongoing for many customers? do you think the customer with EAP-TTLS also face the same issue?

Greg Gibbs · ‎02-27-2025

I cannot speak to what MS is doing to resolve this. You would need to speak with them for current info.

The fragmentation and out-of-sequence packets are due to the large certificate payload. EAP-TTLS(PAP) does not use client side certificates, so it should not have the same issues. It is password-based, however, so there may be various other implications.

Jagermeister · ‎02-28-2025

This is still going on and I think it is not going to change. I'm facing the same issue and reached out to Microsoft about this, the basically said the following:

1. If out-of-order fragment reordering is needed, Azure can only enable this with the following limitations and requirements:

- The VM needs to be fully maintained and have all the applicable security patches in a timely manner

- The out-of-order fragments must originate from the internet to a public IP address attached directly to a VM.

- The out-of-order fragment reordering flag only supports specific VM SKUs, generally Dsv4, Ev4, Bv1 and earlier. The compute optimized FSv2, which Cisco recommends to use for a PSN, wasn't supported according to the Azure engineer

- Allowing out-of-order fragments reordering exclusively applies to public IPs attached to the VMs. It is not supported for load balancing, ExpressRoutes or VPN gateways.

- All the VMs MUST be deployed into a new empty subscription, which is pinned to compatible hardware clusters

- If VNets need to communicate across subscriptions you can use VNET peering, although VNet peering does NOT inherit the UDP fragment flag.

Also, Cisco is stating that two Azure regions have a 'fix' applied already and that these regions will allow out-of-order fragmented UDP. I've asked Microsoft about this and they claim that this is false. The Azure engineer told me that Azure East Asia and Azure West Central US also need this flag and will drop out of order fragments by default.

Note that it seems that ONLY Microsoft Azure is doing this. AWS, OCI and physical DCs that I've tried seem to not drop the fragmented traffic and my 802.1 eap-tls supplicants could authenticate fine if the PSN was hosted there.

elbertdue · ‎02-28-2025

We deployed ISE in Azure back in March (version 3.3.0.430) with 4 ISE servers (PAN/PSNs), using ExpressRoute from on-premise to Azure and Meraki APs. However, we're seeing hundreds of "Supplicant stopped responding to ISE during EAP-TLS certificate exchange" errors daily. From what I've read, this could be an MTU sizing issue since Azure might be dropping fragmented packets. Minecraft Pocket Edition

The deployment guide mentions this is a known problem with DMVPN and SD-WAN connections, and the solution is to ask Microsoft to enable the "out-of-order fragments" option. But when we reached out, they told us the fix would require setting up a new subscription, using Gen 7 VMs, and routing traffic over the internet – which isn’t feasible for us since we rely on ExpressRoute.

I’ve also read that this issue is supposedly fixed in East Asia and West Central US, but nothing’s changed in UK South. Has anyone else faced this and found a workaround that doesn't involve completely redoing the setup? Also, is there a way in ISE to confirm that Azure is dropping fragmented packets so we can provide evidence to our account manager?

Jagermeister · ‎02-28-2025

What you can do is pointing a supplicant to test with to a specific PSN node and then run packet captures across your network path. You can track them by using the identification field field in the IP header (e.g. wireshark: ip.id == 0x66c4). Look if there is a lot of fragmented traffic and try to track the fragmented access-requests. Check if you see certain sessions that aren't received by the ISE node and check if they egress your Express route interface on prem. If there's an intermediate device like a firewall in between you can check if that hop receives all the fragments or not. If it was egressing correctly at your on prem express route interface but not received on the PSN you at least know that its getting lost somewhere in between the express route and Azure.

To create a tcp dump on the PSN:

Go to Operations -> Diagnostic tools -> TCP dump and select the PSN node to create a pcap on the PSN VM itself.