Solved: ISE: possibility to temporarily disable answering RADIUS Requests

ffischer · ‎01-17-2024

Hello,

consider a deployment where endpoints authenticate with EAP-TLS
and ISE is using AD integration for retrieving and checking group membership of authenticating hosts.

While registering new (or re-imaged) nodes to such ISE deployments
we always runn into 2 problems:

Just after the registration, restart and sync,
the new node knows our NADs and begins to answer RADIUS requests.
But the node is not yet joined into AD.
Without AD Group Membership, he denies the clients trying to authenticate.
Aditionally, in the beginning, he uses its self-signed certificate for EAP, which clients don't trust.
This confuses some supplicants in a way they need a cold boot.

Sometimes, we can use firewalls in place between NADs and ISE nodes
to block RADIUS packets and make the ISE appear unavailable to NADs to avoid such issues.
But often, those firewalls are not available or not easily changeable for us.

Now I would like to be able to specify an optional
"disable answering of RADIUS Requests"
during node registration.

This would permit us to join the new node to the ADs and provision the node certificate from the private PKI
without clients being rejected erroneously.

To avoid TAC requests from admins not understanding that option, it could be off by default...
And/or it could be implemented as "disable answering of RADIUS requests after restart for ____ minutes"

Or is there any other solution ?

Marvin Rhoads · ‎01-17-2024

When you add the new node, don't select the Policy Service persona for it. Only do so later - after you have joined it to AD and applied the desired trusted certificate.

View solution in original post

thomas · ‎01-17-2024

An ISE node has the RADIUS service enabled by default because that's it's job. The TACACS+ Device Administration service is disabled by default. So really what you need to do when [re-]provisioning an ISE node is turn off the services (Administration > System > Deployment > node_name) until you are ready to bring the node back into service:

To minimize the window of opportunity for a network device that has been configured to use this ISE node to re-discover it and attempt to use it, you could use the ISE Deployment APIs or the respective node_deployment Ansible module to do this for you. We have done many ISE Webinars on ISE REST APIs and automation which are posted to our CiscoISE YouTube channel:

▷ Upgrading ISE in the Cloud with Automation 2023-11-07
▷ Cloud Load Balancing with ISE 2023-06-15
▷ ISE in a Hybrid Cloud Environment 2022-12-06
▷ Automated ISE Provisioning and Patching 2022-11-03
▷ Practical ISE Automation with Ansible 2022-10-06
▷ ISE REST APIs Introduction 2022-10-04
▷ What's New in ISE 3.2 - Part 1 2022-06-02
▷ Automated ISE Setup with Infrastructure as Code Tools 2021-12-07
▷ ISE 3.1 APIs, Ansible, and Automation 2021-07-06
▷ ISE REST APIs 2021-04-06

We have also posted our GitHub repositories with code examples and these should be of most help to you:
https://github.com/1homas/20221004_ISE_REST_APIs_Introduction
https://github.com/1homas/ISE_Provisioning_and_Patching
https://github.com/1homas/ISE_Ansible_Sandbox
https://github.com/ISEDemoLab/Upgrade_ISE_in_Hybrid_Cloud

View solution in original post

MHM Cisco World · ‎01-17-2024

Dont use dot1x system-auth-control command in Global Configuration mode
this make NAD not accept any 802.1x. until you config NAD in ISE then add this command
MHM

ffischer · ‎01-17-2024

Thanks.
Well, maybe you could do that in a small or lab environment with a low number of switrches.
Maybe I forgot to mention, that this is a productive environment
with nearly 10.000 Endpoints distributed on several hundred NADs (switches/WLCs)
over 14 ISE PSN Nodes in 7 sites....

MHM Cisco World · ‎01-17-2024

maybe critical VLAN can solve your issue, did you check this solution
if the PSN return failed to NAD the NAD will use critical VLAN and hence the endpoint can access
then if PSN integrate with AD the PSN return success and assing VLAN dynamically
MHM

ffischer · ‎01-17-2024

Thanks again for your hints.
We have 2 PSNs in every location. The design goal was 100% redundancy.
We have configured the NADs to load balance their requests to both local ISE nodes intentionally.
As soon as one of the ISE answers the requests, the answers must be valid.
If one ISE node is down (= not answering RADIUS requests), the switch fails over to the other.
Wrong "denies" from the ISE i.e. caused by missing AD connectivity, will cause outages on the endpoints.

I'd suggest not trying to find a 90% workaround in the switch configuration ...
if the issue could solved 100% on the root...
see the other answers...

MHM Cisco World · ‎01-17-2024

OK, now I get you
there are two PSN is the SW send to first one that NOT integrate to AD this will cause return failed to SW and disconnect the endpoint.
so you need away that when request come to PSN (not complete integrate with AD) to not response and make SW try other PSN.
thanks for clarify
have a nice day
MHM

Marvin Rhoads · ‎01-17-2024

When you add the new node, don't select the Policy Service persona for it. Only do so later - after you have joined it to AD and applied the desired trusted certificate.

ffischer · ‎01-17-2024

Hello Marvin,
this is indeed an approach, I have overseen...
It should work, only needs a slight modification:
For starting registration I have to enable at least one persona out of "ADM" "MnT" "PSN" or "pxGrid"
(Would be nice if I could register without and select what should run later...)
Now, ADM or MnT cannot be used as there are already 2 ADM and 2 MnT in the deployment.
So I could try to select only the pxGrid persona,
or only the PSN service
but then disable the session services, what should prevent RADIUS service from listening.
enable only profiling or PIC services temporarily. to get it registered.

Then register the node, join AD and deploy certitificate and switch services...

Lets hope, it does not restart ISE services completely again.. every restart takes 15-20 min...

thomas · ‎01-17-2024

An ISE node has the RADIUS service enabled by default because that's it's job. The TACACS+ Device Administration service is disabled by default. So really what you need to do when [re-]provisioning an ISE node is turn off the services (Administration > System > Deployment > node_name) until you are ready to bring the node back into service:

To minimize the window of opportunity for a network device that has been configured to use this ISE node to re-discover it and attempt to use it, you could use the ISE Deployment APIs or the respective node_deployment Ansible module to do this for you. We have done many ISE Webinars on ISE REST APIs and automation which are posted to our CiscoISE YouTube channel:

▷ Upgrading ISE in the Cloud with Automation 2023-11-07
▷ Cloud Load Balancing with ISE 2023-06-15
▷ ISE in a Hybrid Cloud Environment 2022-12-06
▷ Automated ISE Provisioning and Patching 2022-11-03
▷ Practical ISE Automation with Ansible 2022-10-06
▷ ISE REST APIs Introduction 2022-10-04
▷ What's New in ISE 3.2 - Part 1 2022-06-02
▷ Automated ISE Setup with Infrastructure as Code Tools 2021-12-07
▷ ISE 3.1 APIs, Ansible, and Automation 2021-07-06
▷ ISE REST APIs 2021-04-06

We have also posted our GitHub repositories with code examples and these should be of most help to you:
https://github.com/1homas/20221004_ISE_REST_APIs_Introduction
https://github.com/1homas/ISE_Provisioning_and_Patching
https://github.com/1homas/ISE_Ansible_Sandbox
https://github.com/ISEDemoLab/Upgrade_ISE_in_Hybrid_Cloud

ffischer · ‎01-17-2024

Hi Thomas,

thanks... will add the next node with session services disabled.. see my answer to Marvin above.

And thanks as well for pointing me to the REST-API again...
Discovered, that finally calls for manipulating the Nodes' system certificates and trust store have been added !
And I am quite confident that I will not perform the next system certificate renewal on 18 nodes in the ISE GUI

Arne Bier · ‎01-17-2024

I applaud your request because this is also the bane of my existence during an ISE deployment rebuild in a production environment. It won't be an issue for folks with load balancers, because they can exclude the new PSN from the load balancer until the PSN is fully built up. But for the rest of us, we have potentially hundreds or thousands of NADs that could send a Request to an IP address of a PSN that is half-baked. And using a FW to block UDP/1812 and UDP/1813 is often not viable.

Here is my solution. It's not 100 bullet proof, but it buys you some time. As soon as the newly built PSN GUI Admin comes active (monitor is closely with "show application status ise" and also have the https page open to show the message that ISE is almost ready), login as quickly as you can, and then head over to the RADIUS protocol section and change the UDP ports from 1645/1646/1812/1813 to a higher value - I just the digits "10" in front of each value and then click save. That will make the PSN go deaf (drop requests) to any genuine attempts from NAD devices. A dropped request informs the NAD to try another RADIUS server instead. And there is the other missing piece to make this work. In your NAD devices, configure a Dead Timer that has a hold-down value that is sufficiently large - e.g. 60 minutes. That tells the NAD to not try again until 60 minutes have passed. That gives you enough time to join the AD and get certs installed. Or at least, it gives you an hourly window in which a NAD won't bother the PSN. And remember, one NAD might have 10, 100 or 1000 attached clients - it only takes one client on that NAD to cause the dead timer to trigger - that means that on NAD devices with many clients, only one client may feel a slight delay because there was no response from the RADIUS server. The other clients will benefit from that experience for at least one hour.

But my solution is not perfect - it works only as long as the new PSN has not completed its registration sync with the PAN. During the sync-up, the UDP ports on the new PSN will be programmed to the correct values again - when exactly this happens is hard to say.

I have been asking for a kind of "Maintenance Mode" option in ISE to take a PSN out of operation. You see this feature in products like VMWare ESXi - and a new ISE node should NOT be enabled (in my opinion) by default to listen on those UDP ports until the operator decides it is ok to do so.

My 2c worth.

MarcusFLey · ‎01-26-2024

What I like to do in this scenario is to prepare a new node with a temporary IP address (but all other settings including Hostname as they should be).

I use this IP address to bring the node up to snuff, e.g. install patches and import certificates. Then, the critical time starts when you configure this node to use the productive IP address.

At this point I see some options:

Block RADIUS towards the node (Firewall or Port-ACL). This is my favorite approach, because it keeps the node isolated as long as it is needed, but it is not always viable.

If that is not possible, you can do a few things to stop the node from answering RADIUS requests as long as possible. In all cases, this will only be effective until it is registered to the deployment.

Change the RADIUS ports, as Arne suggested. I never tried it, but it should work at least until the node is registered to the deployment. I will most likely get the "real" ports synchronized then, I assume.
If you restored a Backup on the node: Configure a RADIUS authentication policy for the internal databases (user and endpoint). All requests should fail, as they are empty at this point. You then set the Advanced Options for this rule to "Drop" for all failures instead of "Reject". The node should then silently Drop all requests and appear "Dead" to the network devices.
If the node is fully "fresh" its Network Device database is empty, so it should "Drop" all Requests as they come in from an unknown NAD (see attached Log entry for this case).

Overall, I would really wish a maintenance mode where the node can be registered to a deployment and prepared for duty. It should not answer RADIUS or TACACS requests in this state.

ffischer · ‎01-29-2024

thanks for all the ideas !

I re-thought about it and possibly found a substitute for the missing "maintenance mode":

Configure a new policy set on top of all other policy sets.

Entry condition for the PS will be "request from ISE node "under maintenance" "
The PS will have only the default authentication rule with default result: drop.

Now the following happens:

As long as the PSN is not part of the deployment,
it neither does have the NAD tables nor any PS and
therefore drops all requests as from "unkown NAD"

As soon the node is registered in the deployment
and NADs, Policy Sets etc. are synced,
it would start to answer requests incorrectly until certificates are installed and beeing joined into the AD.

Here the new policy set and its AuthC rule described above kicks in and drops all requests coming from this node !
And all NADs will continue to use their other RADIUS servers configured...

MarcusFLey · ‎01-29-2024

Very good thought!

That way, you would have a node-specific policy that is part of the universal policy set. So after registration, it would still apply as an "umbrella" to cover the node from yet unwanted requests.

But you should make sure that the Authentication Result "DenyAccess" does not return an Access-Reject. You have to make it "Drop", which can be applied for one of the three options under "Advanced" Options.