This community is for technical, feature, configuration and deployment questions.
For production deployment issues, please contact the TAC! We will not comment or assist with your TAC case in these forums.
Please see How to Ask the Community for Help for other best practices.
I have an ACS 126.96.36.199 patch 3 and Novell Netware LDAP Server separated by a Firewall. The Firewall's default tcp session timeout is 3600 seconds.
When no LDAP-Request is made for over one hour, the Firewall drops the connection from its table. The Problem is, that the ACS-Server thinks the connection is still open. When it tries to send an LDAP-Query this results in retransmissions and finally a RST... On the User side the Authentication attempt fails (timeout).
I tried to enable TCP Keepalives on the Windows-Server side, but this has no effect on the LDAP-Connections used by ACS.
Is there any possibility to enable Keepalives in ACS?
Thanks in advance for any help!
You are seeing the effects of bug CSCti03338 which I filed a few months ago, though it is supposed to be fixed on 4.2.1(15) patch 3. Please open a TAC case so we can look into this in detail.
Apparently this bug has re-appeared in ACS 5.2 (188.8.131.52). ACS re-uses stale TCP connections many hours after the last TCP packet was sent.
It also uses different TCP connections for LDAP search queries and the subsequent authentication bind requests, so sometimes the search query and sometimes the bind request fails due to the TCP connection been timed-out long ago on all network devices (stateful firewalls, IDS/IPS, load balancers) between the ACS and the LDAP servers.
Further ACS fails to detect stale TCP connections and reports bogus authentication failures back to the NAS.
A new ticket will be filed with TAC today.
I'm seeing this issue too on 184.108.40.206.1, running LDAP auth through a F5 Load Balancer to a pair of Sun directory servers.
Did you make any progress with your TAC case?
Without using the root patch, this command is useful for finding out what is going on (it's just netstat):
# show tech-support | i ldap | i tcp
ldaps 636/tcp # LDAP over SSL
tcp 0 0 exc2-acscor-1401:53892 acs.ldapunix.co:ldap ESTABLISHED
tcp 0 0 exc2-acscor-1401:53893 acs.ldapunix.co:ldap ESTABLISHED
tcp 0 0 exc2-acscor-1401:53890 acs.ldapunix.co:ldap ESTABLISHED
tcp 0 0 exc2-acscor-1401:53891 acs.ldapunix.co:ldap ESTABLISHED
tcp 0 0 exc2-acscor-1401:53889 acs.ldapunix..co:ldap ESTABLISHED
LDAP Connection Management
ACS 5.1 supports multiple concurrent LDAP connections. Connections are opened on demand at the time of the first LDAP authentication. The maximum number of connections is configured for each LDAP server. Opening connections in advance shortens the authentication time. You can set the maximum number of connections to use for concurrent binding connections. The number of opened connections can be different for each LDAP server (primary or secondary) and is determined according to the maximum number of administration connections configured for each server.
ACS retains a list of open LDAP connections (including the bind information) for each LDAP server that is configured in ACS. During the authentication process, the connection manager attempts to find an open connection from the pool. If an open connection does not exist, a new one is opened.
If the LDAP server closed the connection, the connection manager reports an error during the first call to search the directory, and tries to renew the connection.
After the authentication process is complete, the connection manager releases the connection to the connection manager.
Here are some informations i can share with you due to the TAC-Case i opened (which will be closed by now):
- The mentioned bug above (CSCti03338) seems to be fixed, because ACS opens a new connection if the old one fails, which ACS appereantly didn't do before this fix. Before the fix, ACS was unable to open a new connection and therefore couldn't handle any new requests if the connection was dropped by a firewall.
- Now ACS is able to open a new connection when another fails. This has nothing to do with a keepalive-mechanism, it's only the ability to react to a dropped connection (multiple retransmits, finally a fin and then a new connection is opened)
- This process of detecting the dropped connection and opening a new one takes over 20 seconds
- This is standard-behaviour for any ACS-Version later than 4.2 (including 5.x versions)
- I was informed that cisco is internally discussing, if a feature-request should be placed for a keepalive-mechanism in future versions of ACS
Therefore i don't see any solution to this on ACS side by now.
The only possibility for now is to increase the tcp-timeout value on your firewall or load-balancer to something that will never be reached.
Hi, thanks for the reply.
I can add some information here - we have found a workaround.
-The behaviour I am seeing is that when the ACS tries to re-use a dropped connection (dropped by the firewall/load balancer), the runtime process actually crashes and needs to be restarted! The main problem with this is that the user's authentication fails and they are dropped out of their session, be it TACACS+ or RADIUS. This is unacceptable behaviour. This is only for connections between the ACS units and the LDAP server.
Example of runtime process crashing:
Feb 16 08:32:01 exc2-acscor-1402 monit: 'runtime' process is not running
Feb 16 08:32:01 exc2-acscor-1402 monit: 'runtime' trying to restart
Feb 16 08:32:01 exc2-acscor-1402 monit: 'runtime' start: /opt/CSCOacs/bin/exec_wrapper.sh
Feb 16 08:33:01 exc2-acscor-1402 monit: 'runtime' process is running with pid 17676
Thanks for your info!
I totally agree that increasing the timeout value is not an acceptable solution for this problem.
We implemented a workaround with this nice perl script: https://www.monitoringexchange.org/inventory/Check-Plugins/Network/check_radius-pl
We use the script to check every x minutes whether a login is possible. This gives, as you mentioned, the bonus of being informed when ACS fails. The main goal, not letting the TCP-Session time out, is realized by issuing a login in an lower interval than the TCP-Timeout on the firewall.
I think it's important that everyone who is facing this problem, contacts the TAC or at least posts here. This way Cisco eventually will recognize that this missing Keepalive-Mechanism is really a problem. I really hope that Cisco will implement a Keepalive-Mechanism in future releases...