cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5865
Views
5
Helpful
13
Replies

ISE AAA load-balancing issues

I had opened a TAC case with this issue, and their recommendations as a work around leave a few things to be desired, so I thought I would throw this out there.

On our 3850 switches, running 03.07.04E, we have 

aaa group server radius ISE
server name authnad-w2
server name authnad-b2
server name authnad-w1
server name authnad-b1
ip radius source-interface Vlan255
load-balance method least-outstanding

aaa server radius dynamic-author
client 10.3.14.239
client 10.9.14.241
client 10.3.14.240
client 10.9.14.239
server-key 7 <secret>

We also have an extensive set of ISE rules, rules that include multi-factor authentication and reauthentication.  One of our rule flows accepts EAP chaining and the user SUI, and machine credential are authenticated in a first pass, and then a second pass is used to authenticate against an RSA-type token and a guest portal.  

What we found with v1.3 of ISE was that it was possible that when a particular client had a CoA event, the session ID would get transferred to a different PSN, and that the second PSN, having no record of that particular session ID, would start over.

From a user perspective, it looks like the RSA authentication failed (no reason given), and they would be presented with a second portal login screen.

From the troubleshooting from the TAC case, we could see that the session ID was constant, and that the servers changed.  Cisco's solution was to remove load-balancing altogether, or to use something like an F5.  We initially went with the former solution.

The problem appears to me to be that the switch is checking for the least busy server when it does a CoA, and that check should not be made at that point in time because the switch will cause the existing session ID to be ignored by the new PSN.

While in a configuration with no load balancing, we found the solution to be quite stable, until the lead server in the ISE group became unavailable.  While the server was unavailable, users were still attempting to authenticate to it.

Ignoring that an unavailable server can be one of several states, say during an upgrade from v1.3 of ISE to v2.1, we kind of discovered exactly what no load balancing means, and well, we see how less-than-optimal that solution is.

We started down the path of looking at the F5 solution, but there are aspects of that that don't seem practical.  

So, I'm of the opinion that this issue with no radius load-balancing is an issue with the switches(and WLCs), and that this is a bug in the IOS XE 03.07.04E code.

We appear to be having the same issue with ISE v2.1 as well.  We've tried load-balancing both  with and without the ignore preferred-server parameter.  Without seems to logically be the right choice, but it no workie.

Just curious if anybody else has had similar issues.  We'll probably look at node groups eventually, but I think this is a bug.

1 Accepted Solution

Accepted Solutions

Hi David,

I am following this post as we are planning to implement the F5 solution in our wireless environment. Could you please be more specific about the "workaround" suggested by Cisco. I would like to raise this question to them.

thanks

View solution in original post

13 Replies 13

It appears to be that CSCuy94702 affects 03.07.04E, and not just 03.06.03E.

Hi David,

I am following this post as we are planning to implement the F5 solution in our wireless environment. Could you please be more specific about the "workaround" suggested by Cisco. I would like to raise this question to them.

thanks

There is a rather involved write up for this here.

We're also considering the F5 as an option. 

In the mean time, I've determined that almost all of the Cisco background on 802.1X was written prior to CoA, and that the TAC still believes that a batch-size of 50 is considered "large".

If you look at the command on an actual 3850 switch, you'll see that the maximum allowed batch size is 2,147,483,647, so you should have plenty of leeway.

I took the maximum number of transactions seen in a 5 minute period (high load, around 8:30AM) and came up with using 1800 as a batch size.  I'm not 100% on the equivalence of a Session vs a Transaction, but this seems to work for us.  My goal was to minimize the possibility of a least-busy server transition in a 5 minute period of time, and I believe I have succeeded.  To be honest, I couldn't really see much of a downside if the number was too large.  We have 4 PSNs in two locations, 8 servers in all.

Thanks David, for your quick reply.

David,

Thank you for sharing these information. I'm about to use the switch load balancing feature for 4 servers split  in two Datacenters 

In addition, if you decide to go with F5, I have some thoughts from experience for a different deployment :

- the persistence of the session in F5 will be critical (for a endpoint, the authentication, accounting and profiling updates information should go to the same PSN server)

so I would advise to use the same VIP for authentication, accounting and DHCP (in case you use DHCP profiler)

- we discover a bug that impacted the persistence behavior and it was fixed on 11.6.1 (ID 554774).

Thanks 

Have any of you seen the following error message regarding Accounting on ISE??.

In fact, in the deployment guide there is something missed about configuring the same SHARED SECRET for the F5 VIP in the WLC to be the same value for the WLC configured in the ISE Network Device tab.

The guide explains the same shared secret BUT for the health monitoring topic. Not for the actual authentication.

I am doing more research on this issue and I will post any result.

thanks

Abraham,

I might have misunderstood your statement but the VIP in the controller is mapped to the ISE PSN servers so you need to have the same shared secret in the WLC for that VIP than in ISE server for the WLC IP

For the monitoring on F5, in ISE server you need to create a device with F5 IP address and use the same shared secret than in the monitoring configuration in F5.

thanks

Hi Amadou,

Let me clarify my point above, the VIP is another entry in the WLC AAA list. The SSID, let's say for PEAP, uses that VIP IP as AAA so the VIP shared secret in the WLC AAA entry must be the same that we have in the WLC entry configured in the Primary PAN Network device list. Otherwise, the authentication does not work. We tested on IPAD, Iphone, Win, Samsung Tablet and Chromebook.

The information above is not included in the F5 implementation guide. Only the health monitoring configuration on which we create an internal ISE user and include the F5 Internal interface (PSN's subnet IP) in the network device list. The shared secret on ISE for that F5 device is the same we configure on F5 for the Authentication Health Monitoring (Accounting does not need it, F5 has another profile for ACCT).

The problem is that I am getting an accounting error (6 entries) and we do not know where they are coming from and if that could affect the load balancing operation. Any ideas? See the attached file in the previous reply.

In the ISE logs, is the PSN that has authenticated the request the same that the one that generated the accounting error message ?

I would also validate if the IP of the WLC is the same between the authentication request and the accounting request.

I will post the captures regarding Authc and Acct from the internal and external F5 interfaces here so you would see the requests/replies being exchanged.

Answer is YES. But let me do something. Removed from the F5 pool one of the PSN's and do the whole testing with only 1 PSN. I would not have an actual LB but I can narrow down the issue. I will keep posted.

We have exactly the same problem with 3.6.8. Your workaround worked also for us by increasing the batch size. However, as you mentioned, there is always a possibility that reauth might fail. Have you find any 3.x.x release that is not affected?