Radius : fail / fallback - overview ?

Thomas Obbekaer Thomsen · ‎02-08-2023

I seem to have a little radius trouble.

I have two radius servers on a iPSK with radius SSID.

Everything has worked just fine.

But then things (hence this post) started , ehh , not working fine.

A bit of investigation (packet captures) shows that the AP sends Access-Request to the radius server, as expected, then nothing happens (aka no response) and the AP sends the request again (duplicate packet).

Apparently this continues, even though there are two radius servers configured on the SSID.

Why does it not switch to the secondary radius server ?

At the same time, I have another SSID running standard dot1x, nothing fancy, using the same radius servers in the same priority order. This seems to utilize radius 2 and , is my guess, switched over at some point.

Do anyone know where I can see that radius servers switched over in the eventlog ? (I cant seem to find such an option). - And is there a warning anywhere that tells me: "Oh look, your primary radius server has stopped responding " - Im guessing there is not 🙂

Thomas Obbekaer Thomsen · ‎02-08-2023

PS: https://documentation.meraki.com/MR/Access_Control/MR_Meraki_RADIUS_2.0

In that documentation you can enable fallback : "If the fallback option is enabled, once the server with higher priority recovers, the AP will switch back to using that preferred (higher priority) server."

How does the AP know when the higher priority server is back ? ICMP ? Radius requests ? what ?

Thomas Obbekaer Thomsen · ‎02-08-2023

I think I will create a case on this. This is very unclear. I can see from packet captures that the AP tries to ping the secondary radius server (that is the one that is working) and not getting a reply (we dont allow ICMP, but I can see that we might have to). It uses the secondary, but it never tries to ping the primary that is not working to see if its alive, it just at intervals sends a lot of radius requests that way (from clients, so it thinks the primary is alive ? without knowing ?) , and never gets a reply and then tries again (dub packet).

Fallback has not been enabled, so this behaviour seems strange to me.

aleabrahao · ‎02-08-2023

Retry Timing

The Dashboard uses a packet timeout of two (2) seconds. This means that after sending a RADIUS request packet, the Dashboard will wait for a reply for up to two seconds before giving up and trying the next server on the retry list.
The Dashboard will try the next server on the list if EITHER:

The timeout period is exceeded for the packet that was sent, OR
An error packet is received.

Error packets are generally ICMP "Destination Unreachable" packets that indicate either the connection was refused (e.g. no program is listening on the specified UDP port on the destination machine) or the host itself is unreachable (e.g. invalid IP address). If such a packet is received then the next server on the list is tried immediately since the Dashboard knows that it will not receive a reply packet from that server.

The packet timeout is needed because RADIUS servers that are overloaded, or that are behind a firewall that drops incoming request packets, may not send any error packets in response to authentication requests.

I am not a Cisco employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

Thomas Obbekaer Thomsen · ‎02-08-2023

That error part makes me a little confused.

Does it send an UDP to port 1812 to see if it responds ? Or does it send an ICMP (protocol 1) to the host to see if its alive ?

And if both radius servers (or 3. as the max you can configure) do not reply to ICMP then what happens ? - I think this "then what happens" is where Im at right now, so perhaps the documentation should really really specify that ICMP is required for failover ?

aleabrahao · ‎02-08-2023

https://documentation.meraki.com/MR/Encryption_and_Authentication/RADIUS_Failover_and_Retry_Details

If that's not enough, it might be better to open a support case.

I am not a Cisco employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

Thomas Obbekaer Thomsen · ‎02-08-2023

PS: That document referenced there seems to be for splashpages (Guest) not dot1x or iPSK with Radius. - Are we sure the same applies for these ?

https://documentation.meraki.com/MR/Encryption_and_Authentication/RADIUS_Failover_and_Retry_Details

Raphael_L · ‎02-08-2023

Pretty sure it will rely on Radius testing. If any legit response from the radius server it will flag the radius as 'up'.

Thomas Obbekaer Thomsen · ‎02-08-2023

Well , I do see that, all of a sudden , the AP sends a lot of radius packets (unfortunately from client authentications) to the primary radius server. - These fail / timeout, and are actually also retransmitted from the AP, because there was no answer, I do not see any "meraki test" radius messages. I have a bad feeling about this.

Raphael_L · ‎02-08-2023

Is it possible to share a screenshot of that pcap ? You can blur all the info , change the IPs / MACs if needed.

I tested Radius failover couple days ago and it was working as expected with MR 29.5.

Thomas Obbekaer Thomsen · ‎02-08-2023

Sure, no worries, I will post one tomorrow.

Thomas Obbekaer Thomsen · ‎02-10-2023

So here is an output from one AP .142 that is only running the dot1x SSID. - .101 is Radius 1 - and .102 is Radius 2.

Everything seems fine, the AP has switched to R2 (because R1 does not respond to radius messages). But half way down , "kinda" highlighted, it sends an Access-Request to R1 (For some reason) - This is a normal Access-Request, I can see the "client information" inside that packet. It also sends Accounting to R1 non of these packets are answered, so why did it all of a sudden try this, for a real client, to R1 ? - Then once in a while, ICMP is also send, but for the entirety of this capture it is always for R2 , never R1 (And as you can tell, ICMP is not allowed on this network). The output here "repeats" , in the sense that all of a sudden AAA messages are send to R1. Why does the AP do this with real client AAA's ? Why does it not use something else ? - I think this is broken.

Thomas Obbekaer Thomsen · ‎02-10-2023

The iPSK with Radius SSID has been a little more difficult to capture (because on this specific network there is only one "PSK" client), and when it does not get authenticated, it will select another AP "close" by and try again on that AP. So the below output is a little "copy past" of captures, but it is what I see.

There is NOT a lot to go on here. I mean, sure, I can see the AAA packets being send to R2 for the dot1x SSID ( I "filtered" those out of the pic) , but whenever AAA is being send for the iPSK SSID, it ONLY tries R1, and fails (and by fails, I mean timeout). The second I manually switched that iPSK SSID to R2 as primary, everything started to work on the iPSK network.

Thomas Obbekaer Thomsen · ‎02-10-2023

Meanwhile on the Meraki dashboard , when looking at info for this client (that cannot connect because the radius server is failing) it looks like this :

And the timeline part for this client looks like this (below) : (sorry for cutting this pic) - But they are all "successful, for that iPSK SSID but a few different APs because the client will switch to another AP when it has not had a proper connection.

I mean, you can clearly see something is going on. But since there a NO other warnings in the dashboard about AAA not working, you kinda have no clue where to start.

I mean, there was not even a warning about having switch to R2 on the dot1x SSID.

aleabrahao · ‎02-10-2023

Did you ask Meraki support?

I am not a Cisco employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.