Solved: Radius Server Health Check in distributed ISE deployment

floriandarras · ‎01-04-2024

Hello Everyone,

We are working on an ISE deployment with 2 PAN/MNT nodes (located in the cloud) and 5 PSNs located across different physical sites. We are searching for a way to enhance the detection of a failing PSN onsite by the switches to switch to another one more efficiently, but I am not able to find the information I am looking for online. I have even opened a TAC case, but I am not really confident in the answer so I thought I would give it a shot here.

What we are trying to achieve is the following (requirements):

Each switch is configured with 3 radius servers, the primary being the onsite PSN and the 2 others being PSNs located on other sites.
In normal conditions, all radius requests are forwarded to the local PSN (this is to make sure latency is as low as possible during authentication).
Upon failure of the local PSN (downtime, service crash, etc.) the switch proactively detects it and begins forwarding the requests to another configured PSN until the server is detected alive again.

From what I got from the documentation and the exchanges with the TAC, there is an automated-tester functionality, but it is only available when using Radius server load balancin, which would not meet our requirement n° 2.

There is also the radius-server deadtime command, but this only monitors active authentication from clients and is not able to detect that the server is alive again and is only relying on a timer. This means that the switch is not able to detect an issue before a user tries to authenticate, so a user might timeout or encounter authentication issue.

We are working with another brand of switches that provides an active health-check by probing the server with a dummy username/password request and as long as it receives access-reject response it considers the server as alive. If the server stops responding, the switch knows the server is down and switches to another server or local authentication. We are trying to achieve something similar with our Cisco switches.

Does someone here have a similar setup and found a way to achieve what we are trying to do ?

For reference, all our switches are catalyst 9000 (either 9200, 9300 or 9500).

MHM Cisco World · ‎01-08-2024

If you add the command automate-tester username <dummy username> probe-on to the RADIUS server configuration section, test RADIUS authentications (using the dummy username you entered) are sent to the RADIUS server only when it is marked dead to see if it is back alive. If you configure automate-tester username <dummy user> idle-time <minutes>, the controller sends the test authentication every “idle-time” period even when the server is alive (which can be useful to detect whether it goes dead when there are no authentications ongoing). The automate tester considers the server to be alive if it receives any reply from the server; the tester does not need to receive a successful authentication result (especially because no password was configured). Just make sure the RADIUS server does not ignore such a plaintext PAP authentication, which can sometimes be the case with a default configuration.

View solution in original post

floriandarras · ‎01-12-2024

Hello Everyone,

I tested this extensively in a lab to try and comprehend which command does what and what timers are used.

From what I tested, here are my conclusions:

automate-tester username <user> ignore-acct-port idle-time <minutes>
- Tests the radius server every idle-time to check wether it is alive or not. This allows to detect a radius server being down or coming back up without needing actual authentications on the switch. This works even if the radius-server deadtime is not configured. This is the option we were searching for.
radius-server deadtime <minutes>
- Allows to mark a radius server as down, if the dead-criterias are met (not included here). This needs an actual authentication for the mechanism to be triggered, the switch will not detect proactively that a server is down. Without further configuration, after the timer expires, the server will be marked up again, even if it is still down and will need another actual authentication to detect that the server is actually down.
automate-tester username Dummy ignore-acct-port probe-on
- Can be configured additionally to the radius-server deadtime to prevent a server to be marked up again if it is actually still down. When the server is marked down by the dead-criterias, after the deadtime expires, the automate-tester will send a request to the server, if the server does not answer, the server will remained marked as down. If an answer is received, it will be marked as up again.

I hope this summary is clear and might be useful to anyone searching on here to achieve what we wanted to achieve ourselves.

Thank you very much to anyone who participated to this discussion, much appreciated!

Special thanks to @MHM Cisco World for the last comment, that made things click and I was able to test everything in my lab.
However, I cannot stress enough and join @Arne Bier in saying that a proper documentation with a Flow Diagram would be much clearer and appreciated. I remain a bit frustrated that these functionalities are not better documented in Cisco's own official documentation.

View solution in original post

PSM · ‎01-04-2024

@floriandarras automate-tester under radius server will do the job you want. It exactly does the job you explained about the other brand switches. This can be sample config:

---------------------------------------------

radius server server1

address ipv4 1.1.1.1 auth-port 1812 acct-port 1813

automate-tester username testuser ignore-acct-port probe-on

!

radius server server2

address ipv4 1.1.1.2 auth-port 1812 acct-port 1813

automate-tester username testuser ignore-acct-port probe-on

aaa group server radius radius_servers

server name server1

server name server2

deadtime 5

-------------------------------------------------

In this scenario, switch will always send authentication request to server1, until server1 is unavailable.

With automate-tester configuration switch will send synthetic radius authentication request and as long as switch is getting radius reply back from server1(doesn't matter accept or reject), it will keep using server1. In case of switch doesn't get reply back switch will mark server1 dead and start using server2.

MHM Cisco World · ‎01-04-2024

In normal conditions, all radius requests are forwarded to the local PSN (this is to make sure latency is as low as possible during authentication).

if you config local PSN as first Server then always the SW send to it and automated test check it first
the key point here is config the PSN under the radius server group

https://www.cisco.com/c/en/us/support/docs/security-vpn/remote-authentication-dial-user-service-radius/200403-AAA-Server-Priority-explained-with-new-R.html

MHM

thomas · ‎01-04-2024

@PSM 's suggestion is also documented in our ISE Secure Wired Access Prescriptive Deployment Guide under Best Practice Global Settings for Switch > RADIUS Server Failure Detection

floriandarras · ‎01-05-2024

Hello everyone,

Thank you @PSM and @thomas for your feedback, much appreciated! Too bad that TAC information was not on point.

I have just replicated this in a lab setup to confirm. According to my tests, setting only the automate-tester doesn't work, you also need to define the radius-server dead-criteria and radius-server deadtime (deadtime is not mandatory as it has already a default value that you can see in the show run all).

So in the end my config is just a tad different than the one suggested by @PSM :

radius-server dead-criteria time 10 tries 3
radius-server deadtime 15

radius server server1
address ipv4 1.1.1.1 auth-port 1812 acct-port 1813
automate-tester username testuser ignore-acct-port probe-on
!
radius server server2
address ipv4 1.1.1.2 auth-port 1812 acct-port 1813
automate-tester username testuser ignore-acct-port probe-on

aaa group server radius radius_servers
server name server1
server name server2

However, doing some more testing, I saw that using a switch with nothing connecting on it, and not trying any authentication does not trigger the status of the server to dead.

To make the server unreachable, I am applying the following access-list on the vlan interface of my switch (ip access-group 1 in):

Standard IP access list 1
10 deny 1.1.1.1
15 deny 1.1.1.2
20 permit any log (1643 matches)

Upon application of the access-list, if I do not attempt any authentication, the status of the server always remains UP (the second server is always down at the moment in my lab and I am reverting to local authentication for the ssh session on the switch to test):

SW#show aaa servers | incl State:
State: current UP, duration 1755s, previous duration 900s
State: current DEAD, duration 5145s, previous duration 2030s

From my understanding, the automate-tester should automatically test in the background and change the status of the server if it is not reachable anymore, isn't that correct ? But maybe my show command is not the most appropriate (which I don't think because trying to authenticate to the switch still tries the first server first, and as soon as the local login is accepted it is marked as dead) ?

adamscottmaster2013 · ‎01-05-2024

There is a much easier solution than this. Just setup open-source Nagios to check radius authentication all the PSN nodes. You can set the time interval as low as 5 seconds. You will get an email or text message if radius authentication fails. That's what I do for my ISE environment

floriandarras · ‎01-05-2024

Hello adamscottmaster2013,

This would indeed let us know if the servers are up or down, which is something we will implement, but does not automatically force the switch to use one server or the other as authentication source nor optimize the authentication time if the first server in the list is down.

adamscottmaster2013 · ‎01-05-2024

hi @floriandarras: I see your point. In that case, just put your PSN nodes behind an F5 LTM and that will solve your issue.

floriandarras · ‎01-05-2024

@thomas Reading the ISE Secure Wired Access Prescriptive Deployment Guide again, I see that the automate-tester does not behave as I thought it would. The purpose of the automate-tester would only be to bring the server back up again before the deadtime is reached ? It would not detect proactively that the server is down ?

In my testing, I saw that the server seemed to come back after 10 minutes. I do not see anywhere the frequency of the test done by the automate-tester. Is that documented anywhere ? Is there a way to modify and customize it ?

thomas · ‎01-05-2024

An internet search for cisco ios automate-test command returned Command Reference, Cisco IOS XE Bengaluru 17.4.x (Catalyst 9600 ... with the two commands:

Did you turn the probe-on?

floriandarras · ‎01-08-2024

@thomas Yes, I used the probe-on, you can see the configuration I used in my lab in my earlier post:

radius server server1
address ipv4 1.1.1.1 auth-port 1812 acct-port 1813
automate-tester username testuser ignore-acct-port probe-on

Arne Bier · ‎01-05-2024

I have also run into this issue in the past and the results have been hit and miss. My experience was that the server stayed "dead" until the next endpoint authentication was triggered, or if I triggered a "test aaa" command manually. I always use the probe command too. Perhaps I just got a buggy IOS version. I will keep an eye on this.

Either way, this is a commonly asked question and a nice Flow Diagram would be nice to understand the timers and triggers involved.

floriandarras · ‎01-08-2024

In my lab, the server came back up as "alive" after 10 minutes, with no "real" authentication going on, so the probe seemed to be working to mark the server as alive when it answers again after being marked dead.

But I thought the automate-tester would be able to detect that the server is dead without needing an endpoint or administrative authentication.

MHM Cisco World · ‎01-08-2024

@floriandarras wrote:

In my lab, the server came back up as "alive" after 10 minutes, with no "real" authentication going on, so the probe seemed to be working to mark the server as alive when it answers again after being marked dead.

But I thought the automate-tester would be able to detect that the server is dead without needing an endpoint or administrative authentication.

Automate tester

probe-on

Both need SW send message to aaa to check available.

If there is no new user authc then the SW is silent and dont send anything.

We need to make SW always send message to aaa

1- make reauth timeout small' this make more issue and add more work to SW

2- config the aaa as authc and acct

Config SW send periodic acct

Here always SW send message to server and automate test work if ~~idle~~ or probe-on.

I hope my suggestion work

MHM

MHM Cisco World · ‎01-08-2024

If you add the command automate-tester username <dummy username> probe-on to the RADIUS server configuration section, test RADIUS authentications (using the dummy username you entered) are sent to the RADIUS server only when it is marked dead to see if it is back alive. If you configure automate-tester username <dummy user> idle-time <minutes>, the controller sends the test authentication every “idle-time” period even when the server is alive (which can be useful to detect whether it goes dead when there are no authentications ongoing). The automate tester considers the server to be alive if it receives any reply from the server; the tester does not need to receive a successful authentication result (especially because no password was configured). Just make sure the RADIUS server does not ignore such a plaintext PAP authentication, which can sometimes be the case with a default configuration.