We are in process of migration ACS v5.8 to ISE v2.4 for a large international customer. My customer has a fully distributed deployment, operating on 3 regions/continents (EMEA, APAC, AM). ISE is used for 802.1x and VPN authentication, with AD integration.
In terms of ISE, we have:
In terms of AD, Microsoft multi-domain forest is in use, with two-way trust between all used domains. There are basically 4 domains:
AD servers are located in each region for each subdomain, and servers are at the same time DCs (Domain Controller), GCs (Global Catalog) and DNS servers:
We are using recommended Microsoft Sites and Services for all locations. All PSN nodes are in correct Sites, and there is no latency between e.g. AM PSN nodes and AM DCs (they are on same L3 device).
After successful configuration migration, we have joined PAN, MnT and PSN nodes to emea.mydomain.com (as it is main data center). As there is two-way trust, we can successfully pool AD groups from all subdomains, and we can authenticate users cross-domain. Thanks to this approach, we can use single JP (Join Point) as a reference in our policies. Everything works ideally, however...
From time to time, especially in peak hours for given region, we are receiving "High authentication latency" alarms. As the threshold for this alarm is 10s, I'm a bit worried about this one. We do have high-speed WAN links between regions, but it still might happen that there is a peak in utilization. Also, based on architecture, as we are using Sites and Services, I would expect minimum cross-domain communication from ISE standpoint (I'm aware that there must be some - e.g. EMEA user is roaming to AM, and authenticating to AM PSNs).
I did packet capture, and I can confirm that I can see that AM PSN is talking to AM DC, for captured RADIUS authentication. There is some communication from AM PSN back to EMEA DC, but this should be expected as it is joined to emea.mydomain.com. I can see high latency for multiple ISE services and scenarios, e.g.:
All of the alarms are raised for APAC and AM region, but never for EMEA, which makes me challenge design on AD integration part. Also, alarms are not raised for all authentications, nor entire time, so there is no obvious regularity.
I already went through tons of documentation and Live sessions, but there is actually no document describing how should a system be designed/deployed with multi-domain forest, in terms of which nodes to join to which domain/subdomain, how to build policies based on that approach, etc.
Could you please shed some light on this matter? Any experiences and recommendations with deployments like these?
As you mentioned "especially in peak hours", so it's likely due to load.
In case you have come across it, you might want to start with What's new in ISE Active Directory connector - BRKSEC-2132
Slides 97 and 98 of the reference presentation from Advanced ISE Services, Tips and Tricks - BRKSEC-3697 give advice on auth policy optimization.
I already watched both sessions, but, unfortunately, neither of these sessions covers design perspective of ISE-AD integration in more complex environments. One that I described, is definitely more complex scenario, but hardly unique, so I'm kinda disappointed that there is no official material dealing with this topic. I can only conclude (and couple of Cisco engineers I was in touch regarding this topic agree with me) that this topic is poorly documented.
The issue we are facing is not caused by peak traffic, as it repeats every day. I must stress out, that with ACS we had no such issue. Reason behind this lies in a fact that ACS had capability of joining each node to its own domain, and inside policies you are always referencing it as AD1, regardless of what is behind.
What we are doing now, after so many sessions with TAC, is to create 4 joint points, one for each subdomain, and then to use PSN ID to direct requests to appropriate domains. This effectively means that we'll multiply policy sets with 4, which I very much dislike. However, we did PoC with one subdomain, and I must say it gives promising results for now (latency dropped from over 2s to less that 400ms, which is very good, but we'll work towards further reducing this).
Once we finish with migration and have some good results, I'll write one more post to describe how my policy looks now.
That would be great for you to write it up and contribute to this community. I would also suggest to send it in as a feedback to ISE product management team so to consider it as an enhancement.
Please share your TAC case number so I may communicate it with our product teams. Else, How to Submit an ISE Feature or Enhancement Request