cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5054
Views
13
Helpful
24
Replies

Azure Cloud deployment - fragmentation issues

Jagermeister
Level 1
Level 1

Hi all,

I have my ISE cluster deployed in Microsoft Azure and I have spent for over a week to troubleshoot a certain issue that I'm having. Unfortunately it seems that Azure is causing this and I'm wondering what everyone does to mitigate this. 

The issue:

Access for 802.1X supplicants is very unreliable, sometimes it works but other times it fails 802.1X and the supplicants do not want to authenticate anymore. The 802.1X method is PEAP with EAP-TLS as inner-method. Checking the logs on the PAN are showing the following in these scenarios:

  1. Failure Reason: 5440 Endpoint abandoned EAP Session and started new
  2. Failure Reason 5411 Supplicant stopped responding to ISE

I can reproduce this behavior by simply cycling the switch port a few times, it just 'breaks' again. 

After taking several pcaps and analyzing them, it seems that the traffic is fragmenting sometimes, which might make sense since I'm trying to achieve certificate based authentication. I've read multiple sources and it seems that Microsoft Azure is dropping out of order UDP traffic and that this can cause the two failure messages that I listed above. 

I've decided to test this and deployed a PSN node outside of Microsoft Azure and joined this node to my cluster. All clients that I point towards PSN authenticate perfectly fine. The PAN's alert reports also only report that the dropped radius packets are from my Azure PSN and NOT from the locally deployed PSN. 

Now I have read about the following options;

- Get Microsoft to enable 'allow out-of-order fragments' option

- Pin the subscription to ensure all instances within that subscription are deployed on hardware generation 7

The thing is that Microsoft seem to require that the Azure subscription needs to be empty. I think this is quite a PITA, since almost no one has an empty subscription and creating a new one is not always an option.

I'm wondering if something can be done outside of Azure to improve this situation. Would adjusting the MTU of the Cisco ISE PSN itself help for example?

24 Replies 24

The github link no longer works:) but here is something from their docs:

https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-tcpip-performance-tuning#azure-and-fragmentation

-Scott
*** Please rate helpful posts ***

Scott Fella
Hall of Fame
Hall of Fame

@Jagermeister You might even want to look at this.  I was also testing with the 9800-CL in Azure. This doc has info on ISE in Azure also with packet captures.

https://www.cisco.com/c/en/us/support/docs/troubleshooting/222339-troubleshoot-fragmentation-issues-affec.html

-Scott
*** Please rate helpful posts ***

Thanks,

 

Today I did a little test and configured my supplicant to use EAP-TLS instead of PEAP-EAP-TLS. Oddly enough it actually works fine with EAP-TLS but PEAP-EAP-TLS is constantly failing.  One thing I noticed in a pcap on the client is that the Meraki switch first seems to attempt EAP-TLS, then legacy NAK's are sent and after a requested EAP-PEAP is sent.  Probably not really a problem but it might delay the authentication a bit. 

I'm still struggling to find 'evidence' in my PCAPs that Azure is actually dropping it. I can confirm that some radius access-requests were transmitted from the client (eg id=18) but on the ISE PSN I see id 17 and 19 but 18 isn't present, meaning that it didn't arrive. 

 

They document that they will drop any UDP packets that come in out of order.  The reason I think EAP-TLS fialed for me was the size of the certificate we were using.  I still think you should have them enable the flag and go from there, the Azure engineer can also take a packet capture on their end to validate that they are dropping the packets, you will not have the ability to do that capture, but they can.

-Scott
*** Please rate helpful posts ***

Yes I guess so.  I actually just encountered the same behavior for EAP-TLS again so false alarm I guess :(.   Hopefully Microsoft will reply soon to me 

Keep us posted.  Have then take a capture while you test so you have that info.  They can verify that its being dropped.

-Scott
*** Please rate helpful posts ***

little update here -  I am in touch with Microsoft now but they claim that they can only enable this flag for traffic that is originating from the internet to a public IP.  We are using a Azure Express route so that is not the case and they claim that they CANNOT enable this for a express route.  What a nightmare!!

I'm not sure what I can do anymore, I simply do not understand how people are able to use Cisco ISE in combination with a ExpressRoute then. Trying to reduce the payload to avoid fragmentation seems impossible and sending radius traffic over the internet is also not an option for me. 

 

oh wow.... they should be able to capture once it comes into Azure.  Ask them to do that while you replicate the issue, that way you know what is happening on their end.  

@Jagermeister take a look at your subscription, do you have a virtual network gateway that your express route gateway is in?

https://learn.microsoft.com/en-us/azure/expressroute/expressroute-troubleshooting-expressroute-overview

-Scott
*** Please rate helpful posts ***

So, I have basically packet capture'd the whole path (where I could). 

At the last hop of our infrastructure I've captured the interface on which the Express Route is terminated and I have also captured the next hop, which is our firewall in Azure. 

Some interesting observations here:

1. There are a few reassembled packets (access-request) that are received on the ISE PSN, this explains why it sometimes works (9/10 times it does not work)

2. reversing the path, tracking the reassembled fragments, I do see that they are always sent/received in order. Incrementing offsets, starting from 0, all fragments have MF flag set except the last one. 

3. Now looing at the packets that are not received, I can see that they are NOT in order at some point, mostly seen at the Express Route interface. Can be identified since the 1th received fragment has an offset >0 and the MF is NOT set, meaning this fragment is the last fragment. The second fragment has offset=0 and does have the MF flag set.

4. Looking at the ingress interface on the next hop, I do see SOME fragments arriving but they all have the MF flag set, meaning more fragments are expected. Since the last fragment is never received the packet cannot be reassembled and it is not traversing to the egress interface. 

Based on this information I have to conclude that the some fragments are lost/dropped, somewhere in between the express route interface at the edge of our infrastructure and the ingress interface of the firewall INSIDE the Azure infrastructure.  I'm pretty sure that Azure drops it somewhere since this is documented behavior by Microsoft and traffic to my other PSN's are also passing this last hop, only not egressing on that interface.

The real question here is: What can you do about this when your traffic needs to ingress into Azure over a ER or VPN gateway instead of the public IP.  Even if Microsoft would allow you to enable this on the ER-gateway, then you will still have great difficulty if you want to use something like a loadbalancer, since it will also drop fragmented out of order UDP. 

Theoretically this can all be avoided by using RadSec but Cisco ISE does not support this, only DTLS for secure Radius. Also the Meraki implementation of RadSec is odd, only MR supports it but not MS... 

SuperC
Level 1
Level 1

Yep, happened to me. Azure said same that can only enable flag if subscription is empty.  I moved my wlc and radius server to AWS and all is gravy.  I also configured custom mtu to 1300 to mirror how I had on AireOS. Custom mtu under each wireless profile
policy. Any vendor of radius in azure will have this issue, it’s not specific to Cisco. All specific to udp fragmentation and out of order udp packers being dropped by azure infrastructure.