Also ACS 5.2 (was: ACS 4.2 - LDAP TCP Keepalive)

GIBBinformatik_2 · ‎12-23-2010

Hello

I have an ACS 4.2.1.15 patch 3 and Novell Netware LDAP Server separated by a Firewall. The Firewall's default tcp session timeout is 3600 seconds.

When no LDAP-Request is made for over one hour, the Firewall drops the connection from its table. The Problem is, that the ACS-Server thinks the connection is still open. When it tries to send an LDAP-Query this results in retransmissions and finally a RST... On the User side the Authentication attempt fails (timeout).

I tried to enable TCP Keepalives on the Windows-Server side, but this has no effect on the LDAP-Connections used by ACS.

Is there any possibility to enable Keepalives in ACS?

Thanks in advance for any help!

Javier Henderson · ‎12-28-2010

You are seeing the effects of bug CSCti03338 which I filed a few months ago, though it is supposed to be fixed on 4.2.1(15) patch 3. Please open a TAC case so we can look into this in detail.

Juergen Meier · ‎01-17-2011

Apparently this bug has re-appeared in ACS 5.2 (5.2.0.26). ACS re-uses stale TCP connections many hours after the last TCP packet was sent.

It also uses different TCP connections for LDAP search queries and the subsequent authentication bind requests, so sometimes the search query and sometimes the bind request fails due to the TCP connection been timed-out long ago on all network devices (stateful firewalls, IDS/IPS, load balancers) between the ACS and the LDAP servers.

Further ACS fails to detect stale TCP connections and reports bogus authentication failures back to the NAS.

A new ticket will be filed with TAC today.

rob.schieron · ‎02-14-2011

I'm seeing this issue too on 5.2.0.26.1, running LDAP auth through a F5 Load Balancer to a pair of Sun directory servers.

Did you make any progress with your TAC case?

Without using the root patch, this command is useful for finding out what is going on (it's just netstat):

# show tech-support | i ldap | i tcp

ldap 389/tcp

ldaps 636/tcp # LDAP over SSL

tcp 0 0 exc2-acscor-1401:53892 acs.ldapunix.co:ldap ESTABLISHED

tcp 0 0 exc2-acscor-1401:53893 acs.ldapunix.co:ldap ESTABLISHED

tcp 0 0 exc2-acscor-1401:53890 acs.ldapunix.co:ldap ESTABLISHED

tcp 0 0 exc2-acscor-1401:53891 acs.ldapunix.co:ldap ESTABLISHED

tcp 0 0 exc2-acscor-1401:53889 acs.ldapunix..co:ldap ESTABLISHED

Also try adjusting "Max. Admin Connections" for LDAP.

From the admin guide:

LDAP Connection Management

ACS 5.1 supports multiple concurrent LDAP connections. Connections are opened on demand at the time of the first LDAP authentication. The maximum number of connections is configured for each LDAP server. Opening connections in advance shortens the authentication time. You can set the maximum number of connections to use for concurrent binding connections. The number of opened connections can be different for each LDAP server (primary or secondary) and is determined according to the maximum number of administration connections configured for each server.

ACS retains a list of open LDAP connections (including the bind information) for each LDAP server that is configured in ACS. During the authentication process, the connection manager attempts to find an open connection from the pool. If an open connection does not exist, a new one is opened.

If the LDAP server closed the connection, the connection manager reports an error during the first call to search the directory, and tries to renew the connection.

After the authentication process is complete, the connection manager releases the connection to the connection manager.

I'd be interested to hear if you have fixed your issue, or if anyone else is facing similar problems load balancing LDAP servers for the ACS.

Cheers

R.

GIBBinformatik_2 · ‎02-17-2011

Here are some informations i can share with you due to the TAC-Case i opened (which will be closed by now):

- The mentioned bug above (CSCti03338) seems to be fixed, because ACS opens a new connection if the old one fails, which ACS appereantly didn't do before this fix. Before the fix, ACS was unable to open a new connection and therefore couldn't handle any new requests if the connection was dropped by a firewall.

- Now ACS is able to open a new connection when another fails. This has nothing to do with a keepalive-mechanism, it's only the ability to react to a dropped connection (multiple retransmits, finally a fin and then a new connection is opened)

- This process of detecting the dropped connection and opening a new one takes over 20 seconds

- This is standard-behaviour for any ACS-Version later than 4.2 (including 5.x versions)

- I was informed that cisco is internally discussing, if a feature-request should be placed for a keepalive-mechanism in future versions of ACS

Therefore i don't see any solution to this on ACS side by now.

The only possibility for now is to increase the tcp-timeout value on your firewall or load-balancer to something that will never be reached.

rob.schieron · ‎02-17-2011

Hi, thanks for the reply.

I can add some information here - we have found a workaround.

-The behaviour I am seeing is that when the ACS tries to re-use a dropped connection (dropped by the firewall/load balancer), the runtime process actually crashes and needs to be restarted! The main problem with this is that the user's authentication fails and they are dropped out of their session, be it TACACS+ or RADIUS. This is unacceptable behaviour. This is only for connections between the ACS units and the LDAP server.

Example of runtime process crashing:

Feb 16 08:32:01 exc2-acscor-1402 monit[4905]: 'runtime' process is not running

Feb 16 08:32:01 exc2-acscor-1402 monit[4905]: 'runtime' trying to restart

Feb 16 08:32:01 exc2-acscor-1402 monit[4905]: 'runtime' start: /opt/CSCOacs/bin/exec_wrapper.sh

Feb 16 08:33:01 exc2-acscor-1402 monit[4905]: 'runtime' process is running with pid 17676

-Increasing the timeout value on the load balancer to something that will never be reached, e.g. infinite, is not an acceptable solution.

-Here's my workaround - we implemented a RADIUS monitor on the F5 load balancer. This monitor/health check does a full RADIUS login every 30 seconds, therefore all the TCP LDAP connections never time out on the load balancer. TCP timeout on the load balancer is 1 hour by default. The bonus here is that if an ACS fails, the node is marked as down so your overall reliability is increased anyway.

-The only problem with this solution is the amount of logs that show up in ACS View. However, you can use "collection filters" in the ACS-View settings to ensure that the health checks are never logged.

hope this helps,

Regards,

Rob

GIBBinformatik_2 · ‎02-21-2011

Hi Rob,

Thanks for your info!

I totally agree that increasing the timeout value is not an acceptable solution for this problem.

We implemented a workaround with this nice perl script: https://www.monitoringexchange.org/inventory/Check-Plugins/Network/check_radius-pl

We use the script to check every x minutes whether a login is possible. This gives, as you mentioned, the bonus of being informed when ACS fails. The main goal, not letting the TCP-Session time out, is realized by issuing a login in an lower interval than the TCP-Timeout on the firewall.

I think it's important that everyone who is facing this problem, contacts the TAC or at least posts here. This way Cisco eventually will recognize that this missing Keepalive-Mechanism is really a problem. I really hope that Cisco will implement a Keepalive-Mechanism in future releases...

Regards, Juerg