Solved: External RADIUS Server timeout

pnavratil · ‎03-26-2020

I am testing the scenario with DUO security used for Two-Factor-Authentication in our VPN.

VPN sits on ASA - ASA sends requests to ISE server serving as RADIUS proxy - it forwards the request to DUO Authentication proxy. I found some usefull guide for this scenario and was able to make it working.

But I encounter some generic problem related to RADIUS proxy on ISE.

In case the external RADIUS server (DUO in my case) does not answer for any reason, it leads to situation the ASA report the ISE RADIUS service is dead (after setup RADIUS timeout) as it does not recieve any response.

So any mulfunction on external RADIUS server (it does not be even be under my administration) leads to situation the ASA stops sending requests to ISE RADIUS because of dead timeout.

I did not find any system solution form this.

To solve this problem it could be usefull to be able to setup ISE to send Access-Reject to requests from RADIUS client (ASA, switch..) in case the External RADIUS Server timeout occurs. With such setup other requests will be send to ISE RADIUS and the dead function (on ASA) will be used only in case the ISE will be realy unavailable. But there is no such option to set this.

Can somebody encounter this kind of problem and find some system way how to solve this?

Regards

Pavel

Cristian Matei · ‎03-27-2020

Hi,

In my opinion, when integrating ISE with any other external NAS/database (RADIUS, AD), if the authentication process fails from ISE point of view (which means the external NAS is not responding), ISE should always drop (regardless if you configure it to REJECT or DROP), thus send no ACCESS-REJECT back to the NAD. I understand that you may see this as a major drawback, but in the big picture is for the greater good. I would agree that this action could/should be configurable (so you can choose between REJECT and DROP), in the end you make a call, but DROP is, from my point of view, the correct decision.

Imagine that you have many NAD's in your network, all pointing towards let's say minimum 2 RADIUS servers, and both of them perform the authentication against the same backend AD, or just one of them against AD and the other one against something like an LDAP directory. If just the primary RADIUS server used by your NAD's would loose its connectivity with the AD or LDAP, so the process fails, and it would send ACCESS-REJECT back to the NAD, the NAD would never actually failover to the second configured RADIUS server, would keep sending ACCESS-REQUEST packets to the primary RADIUS server. If this would be the case, i see no possible solution for NAD failover to another RADIUS server (other than redesigning RADIUS and adding some more message types, which could make the NAD make another decision, like failover; but this is too far from where we stand now, extending RADIUS capabilities not through VSA's, but a complete redesign where the message types are extended; Think that this process is not so easy and it needs meaningful reasons; if it was easy, instead of them coming up with RADIUS CoA RFC, where we just change the roles in RADIUS primarily, they would have just change how RADIUS works to achieve the same thing). So, ISE behaving like dead, like it's not there, not responding is a good choice so that the NAD can actually failover to another configured RADIUS server. In the end the failover needs to happen on the NAD, not on the NAS.

However, you have a fix for this, play with the ASA RADIUS server features, in order to make the ASA mark ISE as dead quick enough and mark it back alive quick enough, or not so quick, depends what you want to achieve. If you have no backup RADIUS server, you just care about the ASA re-activating its RADIUS server fast. With the below settings, you should mark RADIUS/ISE as dead after 6 seconds, and keep it dead for 1 minute, afterwards re-activate it.

aaa-server XXX protocol radius

reactivation-mode depletion deadtime 1----> after how many minutes from being marked dead, do you reactivate the server

mx-failed-attempts 2 ---> after how many failed, unanswered attempts do you mark the server dead

aaa-server XXX (NAMEIF) host a.a.a.a

timeout 3 --->how many seconds do you wait for a RADIUS reply

retry-interval 3 --->after how many seconds do you retry unanswered RADIUS Request

Regards,

Cristian Matei.

View solution in original post

dacabrer · ‎03-26-2020

Pavel,

You can configure a timeout on ISE for the External RADIUS server, this way ISE will reply with an access-reject after that time, just before the timeout expires on the ASA.

Regards,

Daniel

pnavratil · ‎03-26-2020

Unfortunately it is not true.

I just test it with:

ASA: RADIUS timeout: 50 seconds

ISE: External RADIUS Timeout: 10 seconds

ISE detected the External RADIUS is dead and as I have setup only one, there was no other server to send the request to but even with this after 50 seconds the RADIUS server on ASA (ISE) had been marked as dead

------

6

Mar 27 2020

00:25:58

113014

AAA authentication server not accessible : server = 172.22.1.205 : user = *****

2

Mar 27 2020

00:25:58

113022

AAA Marking RADIUS server ise.xxxxxxxx.cz in aaa-server group ISE-RAD as FAILED

so even the timoute on ISE was mutch shorter then on ASA, it led to problematic situation - simply - ISE is not sending access-reject in case External Server timeout occures.

I use ISE 2.6 patch 5.

dacabrer · ‎03-26-2020

Hi,

Try changing the advance option for you authentication policy to reject instead of drop:

Regards,

Daniel

pnavratil · ‎03-27-2020

I just test it - I does not work either - again there is no access-reject sent from ISE.

Cristian Matei · ‎03-27-2020

Hi,

In my opinion, when integrating ISE with any other external NAS/database (RADIUS, AD), if the authentication process fails from ISE point of view (which means the external NAS is not responding), ISE should always drop (regardless if you configure it to REJECT or DROP), thus send no ACCESS-REJECT back to the NAD. I understand that you may see this as a major drawback, but in the big picture is for the greater good. I would agree that this action could/should be configurable (so you can choose between REJECT and DROP), in the end you make a call, but DROP is, from my point of view, the correct decision.

Imagine that you have many NAD's in your network, all pointing towards let's say minimum 2 RADIUS servers, and both of them perform the authentication against the same backend AD, or just one of them against AD and the other one against something like an LDAP directory. If just the primary RADIUS server used by your NAD's would loose its connectivity with the AD or LDAP, so the process fails, and it would send ACCESS-REJECT back to the NAD, the NAD would never actually failover to the second configured RADIUS server, would keep sending ACCESS-REQUEST packets to the primary RADIUS server. If this would be the case, i see no possible solution for NAD failover to another RADIUS server (other than redesigning RADIUS and adding some more message types, which could make the NAD make another decision, like failover; but this is too far from where we stand now, extending RADIUS capabilities not through VSA's, but a complete redesign where the message types are extended; Think that this process is not so easy and it needs meaningful reasons; if it was easy, instead of them coming up with RADIUS CoA RFC, where we just change the roles in RADIUS primarily, they would have just change how RADIUS works to achieve the same thing). So, ISE behaving like dead, like it's not there, not responding is a good choice so that the NAD can actually failover to another configured RADIUS server. In the end the failover needs to happen on the NAD, not on the NAS.

However, you have a fix for this, play with the ASA RADIUS server features, in order to make the ASA mark ISE as dead quick enough and mark it back alive quick enough, or not so quick, depends what you want to achieve. If you have no backup RADIUS server, you just care about the ASA re-activating its RADIUS server fast. With the below settings, you should mark RADIUS/ISE as dead after 6 seconds, and keep it dead for 1 minute, afterwards re-activate it.

aaa-server XXX protocol radius

reactivation-mode depletion deadtime 1----> after how many minutes from being marked dead, do you reactivate the server

mx-failed-attempts 2 ---> after how many failed, unanswered attempts do you mark the server dead

aaa-server XXX (NAMEIF) host a.a.a.a

timeout 3 --->how many seconds do you wait for a RADIUS reply

retry-interval 3 --->after how many seconds do you retry unanswered RADIUS Request

Regards,

Cristian Matei.

pnavratil · ‎03-30-2020

OK - I have got your point it is important for NAD failover - but AD authetication is still little bit different then RADIUS proxy.

Imagine this.

Customer have big ISE deployment - 5 PSNs in his network serves for HA functions. He configured WiFi access with authenticating agains EDUROAM network.

EDUROAM (education roaming) is the secure, world-wide roaming access service developed for the international research and education community.

So it is configured on ISE as RADIUS proxy - it forwards all requests to RADIUS server serves as entry point to the EDUROAM authenticating network. Be aware there are thounds RADIUS servers in this network.

And image the situation - one user is trying to authenticate to customer WiFi (but it can be appled the same for wired network) and he use EDUROAM credentials but the remote RADIUS server in the EDUROAM network which should authenticate this request is not responding (for any reason). This lead to issue customer NAD will mark all his ISE PSNs one by one as dead - so all customer investments to HA ISE deployment are ruined by one remote RADIUS mulfunction server.

This is not hypothetical example - this was really happend and from ISE confuguration there is no way out - no solution.

Customer solved this problem by implementing the same functionality I am triing to call on ISE on the RADIUS server in EDUROAM (radiator to be axact) - so this sends ACCESS-REJECT after timeout.

Cristian Matei · ‎03-30-2020

Hi,

If you have a properly deployed HA, as long as a NAS is up and running you'll end up being authenticated. It depends where you want to move the complexity:

1. NAD has a single RADIUS server configured (like ISE for example) and ISE has further integration with several other user databases, for HA (like other RADIUS server, or AD, etc). With this use case, the complexity is on ISE, it needs to detect unresponsive server and converge to responsive servers; if there is no backend database being accessible to authenticate the user, nothing else really matters and NAD marks the RADIUS server as dead, and you configure deadtime in order to start over.

2. NAD has multiple server configured, for redundancy, and maybe these servers have further integration with other databases/servers. In this case, the NAS converges between its externally defined servers, if none are available, the NAD will converge over to the next configured NAS

From my point of view, if you deploy HA, and there is at least one server being available to validate the credentials, authentication will be successful in the end, it's just a matter to understand your design and configure appropriate timers for failover. So for your case, why do you have a single RADIUS server defined in ISE, since there are many? As you have multiple ISE PSN's for redundancy, you'll have to integrate ISE with multiple external RADIUS servers for redundancy, likewise. And accept the fact that if remote RADIUS servers from EDUROAM are not reachable, and you don't have an alternative mechanism to authenticate the users, there is nothing to be done. I don't really see where the problem is, how do you expect authentication to work if none of the servers owning the credentials are not reachable?

if you could better explain me your desired outcome and current implementation, we'll find a solution.

Regards,

Cristian Matei.

jasonm002 · ‎10-03-2024

What most people don't understand about eduroam is it involves proxying an authentication to a set of proxies. So what happens is:

User shows up at a university campus using credentials not of that institution, i.e. their credentials are user@some-other-institution.edu

ISE running at that campus then uses an external RADIUS server sequence to do proxy authentication for that user

The external RADIUS server sequence is almost always responsive, but the problem here is this external RADIUS server sequence is a set of two RADIUS proxy servers itself, so this set of servers takes the request from user@some-other-institution.edu and forwards it to the RADIUS servers responsible for some-other-institution.edu.

If the servers responsible for some-other-institution.edu don't respond at all, then the timeouts in the local institution's ISE external RADIUS server sequence will both be hit, this causes ISE to drop the request.

So the end result of this is if you have a user show up randomly at your campus from some-other-institution.edu whose home RADIUS infrastructure is not responding at all, you will potentially hit the dead criteria on your NADs (e.g. Cisco WLCs) even though all ISE PSNs are up locally and all of the RADIUS servers in your external RADIUS server sequence list are responding.

If you're using Cisco WLCs, one workaround is to make sure remove the dead time in the radius config on your WLCs, and then configure automated probing of dead RADIUS servers with "automate-tester probe-on" in the RADIUS server definitions in the WLC config. Make sure they're on at least 17.9.5 if running Cisco WLCs. See the documentation for the automate-tester probe-on configuration, it requires ISE (or whatever NAC) to be setup to handle that correctly.

I've observed the behavior in such a setup and the behavior you should actually get in such a scenario (at least in 17.9.5 or later) is: if ISE server is marked dead, WLC sends a RADIUS probe to the RADIUS server (e.g. ISE node) and confirms it's still alive, then immediately marks it alive once it gets the probe response. If the RADIUS server/ISE node is marked dead and it does not respond to the RADIUS probe from the WLC then it stays dead regardless of the dead time being unconfigured, so you still get failover locally. If no dead time is configured then automate tester config should try to probe the RADIUS server marked dead every 60 sec (see https://community.cisco.com/t5/network-access-control/not-able-to-configure-automate-tester-with-idle-time-and-probe/td-p/3791310).

I am working with TAC on this at the moment, in my opinion if you have an authentication policy set in ISE that depends on an external RADIUS sequence and the authentication option for process failure is set to "REJECT" then ISE should send an access-reject if all servers in the external RADIUS sequence in that policy set time out.

chrisnoon11 · ‎01-03-2025

Jason,

May I ask what the end result of your TAC case was? I am having a similar situation with an ASA using ISE for external RADIUS, and the servers are being marked FAILED despite having REJECT set in the auth options in the appropriate external-ID sequence. They eventually recover according to the reactivation timer, but this results in a period of time where none of the servers are ACTIVE. Unexpected behavior from ISE in this scenario would go a long way toward explaining my headache.