Better RADIUS server dead detection?

bricrock · ‎06-30-2017

This may be a question better suited for the EN team (or maybe I've missed some documentation), but how can we achieve a more accurate definition of RADIUS server availability beyond listening on 1812/1645 and 1813/1646? That is, if ISE is using an external identity store (AD, SQL) to authenticate and authorize users/devices, just having the PSN online isn't sufficient -it needs to be able to have connectivity to and be able to perform lookups against that external store.

Much like a web server still listening on 8080, but the tomcat process being hung to the point of not being able to render a web page, port availability does not constitute service/application functionality. It would seem that we could achieve this kind of check with something like F5 LTM health checks; but it would nice to have this available directly in IOS.

Thoughts?

Thank you,

Brian

paul · ‎06-30-2017

There is a dead server detection built into IOS already:

New style command structure:

radius server <radius server name>

address ipv4 <IP address> auth-port 1812 acct-port 1813

key 0 <RADIUS KEY>

automate-tester username <radius test username> ignore-acct-port idle-time 5

Legacy command structure:

radius-server host <IP address> auth-port 1812 acct-port 1813 ignore-acct-authenticator test username <radius test username> idle-time 5 key <RADIUS Key>

I typically don't put a username/password on the switch which means all I am testing is ISE's ability to process the RADIUS transaction. If you wanted to test all the way to AD that could be a service account.

bricrock · ‎07-05-2017

Thanks, Paul.

I'm familiar with the "automate-tester" construct, but, from the documentation: "With this practice, the switch sends periodic test authentication messages to the RADIUS server. It looks for a RADIUS response from the server. A success message is not necessary - a failed authentication will suffice, because it shows that the server is alive."

I'm looking for a way for the RADIUS server to be marked as "dead" when the automated test fails -i.e. if I specify an AD user for the test, and that user isn't able to be successfully authenticated because AD cannot be reached for some reason, mark the RADIUS server as "dead" for whatever duration I've configured.

I realize this isn't a RFC-level requirement as you don't have to use an external ID store; but, in the case of all the customers I interact with, AD is that identity source, and it would be helpful for the interface-level "authentication event server dead" mechanisms to engage as soon as AD is unreachable.

Or is there a better approach?

paul · ‎07-05-2017

Brian,

There are 3 failure conditions in the ISE authentication phase. I believe if you have your authentication policy tied to AD only and the PSN can’t connect to AD properly that would be the process failed condition and should result in a drop. I haven’t tested that to make sure.

Paul Haferman

Office- 920.996.3011

Cell- 920.284.9250

umahar · ‎07-05-2017

You brought a valid point Brian.

I had a customer who accidentally deleted the ISE PSN service accounts of one location.

Now I would have expected that since PSN is not able to query the AD no response would be sent to the switch and it would be marked dead failover over to remote PSN but instead the PSN did send Radius Reject and it was not marked down.

Subsequently all endpoints were rejected access and the whole site went down.

hslai · ‎07-08-2017

That looks like related to CSCva32914

bricrock · ‎07-10-2017

I appreciate the contributions to this thread.

@Hsing-Tsu, that bug shows to be fixed in ISE 2.1 (which the customer is running).

Given the observed behavior at my customer, in conjunction with that seen at @Utkarsh's, it would seem there is something critically missing in the reachability check of a NAD to ISE to AD for the purposes of RADIUS AuthC and AuthZ. We are telling our customers to put their trust in ISE for all network access; we are positioning ISE at the center of "The Network. Intuitive"; yet, we cannot provide a robust mechanism for identity store failure.

Some environments may be ok with an inability to authenticate to the network when there is a problem; but a healthcare or manufacturing customer needs to be able to have a proper business continuity configuration when/if ISE or its external identity store is (verifiably) unreachable.

If we can't do this natively, are there any other ways to get the NAD to properly mark a PSN as down when AD is not available? TCL script?

umahar · ‎07-10-2017

Just to clarify that I saw this issue an year back on ISE 1.4 .

I guess you should test it in ISE 2.1

scarabaus · ‎05-13-2019

Hello bricrock

Are you aware of a suitable solution against the lack of an external identity store (AD) check on a NAD that does radius over cisco ISE?

hslai · ‎05-17-2019

Ensure to configure the authentication to drop in case of process failure. If still not helping, please open a TAC case to investigate and troubleshoot further.