Solved: Re: Radius Server Health Check in distributed ISE deployment - Page 2

floriandarras · ‎01-04-2024

Hello Everyone,

We are working on an ISE deployment with 2 PAN/MNT nodes (located in the cloud) and 5 PSNs located across different physical sites. We are searching for a way to enhance the detection of a failing PSN onsite by the switches to switch to another one more efficiently, but I am not able to find the information I am looking for online. I have even opened a TAC case, but I am not really confident in the answer so I thought I would give it a shot here.

What we are trying to achieve is the following (requirements):

Each switch is configured with 3 radius servers, the primary being the onsite PSN and the 2 others being PSNs located on other sites.
In normal conditions, all radius requests are forwarded to the local PSN (this is to make sure latency is as low as possible during authentication).
Upon failure of the local PSN (downtime, service crash, etc.) the switch proactively detects it and begins forwarding the requests to another configured PSN until the server is detected alive again.

From what I got from the documentation and the exchanges with the TAC, there is an automated-tester functionality, but it is only available when using Radius server load balancin, which would not meet our requirement n° 2.

There is also the radius-server deadtime command, but this only monitors active authentication from clients and is not able to detect that the server is alive again and is only relying on a timer. This means that the switch is not able to detect an issue before a user tries to authenticate, so a user might timeout or encounter authentication issue.

We are working with another brand of switches that provides an active health-check by probing the server with a dummy username/password request and as long as it receives access-reject response it considers the server as alive. If the server stops responding, the switch knows the server is down and switches to another server or local authentication. We are trying to achieve something similar with our Cisco switches.

Does someone here have a similar setup and found a way to achieve what we are trying to do ?

For reference, all our switches are catalyst 9000 (either 9200, 9300 or 9500).

floriandarras · ‎01-12-2024

Hello Everyone,

I tested this extensively in a lab to try and comprehend which command does what and what timers are used.

From what I tested, here are my conclusions:

automate-tester username <user> ignore-acct-port idle-time <minutes>
- Tests the radius server every idle-time to check wether it is alive or not. This allows to detect a radius server being down or coming back up without needing actual authentications on the switch. This works even if the radius-server deadtime is not configured. This is the option we were searching for.
radius-server deadtime <minutes>
- Allows to mark a radius server as down, if the dead-criterias are met (not included here). This needs an actual authentication for the mechanism to be triggered, the switch will not detect proactively that a server is down. Without further configuration, after the timer expires, the server will be marked up again, even if it is still down and will need another actual authentication to detect that the server is actually down.
automate-tester username Dummy ignore-acct-port probe-on
- Can be configured additionally to the radius-server deadtime to prevent a server to be marked up again if it is actually still down. When the server is marked down by the dead-criterias, after the deadtime expires, the automate-tester will send a request to the server, if the server does not answer, the server will remained marked as down. If an answer is received, it will be marked as up again.

I hope this summary is clear and might be useful to anyone searching on here to achieve what we wanted to achieve ourselves.

Thank you very much to anyone who participated to this discussion, much appreciated!

Special thanks to @MHM Cisco World for the last comment, that made things click and I was able to test everything in my lab.
However, I cannot stress enough and join @Arne Bier in saying that a proper documentation with a Flow Diagram would be much clearer and appreciated. I remain a bit frustrated that these functionalities are not better documented in Cisco's own official documentation.