ACE idle connections timeout

AlexandrKry · ‎08-24-2011

Hi everyone,

We're facing a problem with connections being dropped due to inactivity timeout and we can't find the root case of this behaviour.

We created a parameter-map and set connection timeout inactivity to 86400 (24 hours) as described in this document:

http://www.cisco.com/en/US/docs/app_ntwk_services/data_center_app_services/ace_appliances/vA1_7_/configuration/security/guide/tcpipnrm.html#wp1060403

here are the interesting config lines:

access-list TCP_ANY line 8 extended permit ip any any

class-map match-any TCP_CLASS

2 match destination-address 10.0.100.10 255.255.255.255

3 match destination-address 10.0.101.128 255.255.255.240

4 match access-list TCP_ANY

policy-map multi-match TCPIP_POLICY

class TCP_CLASS

connection advanced-options TCPIP_PARAM_MAP

service-policy input TCPIP_POLICY

Here's the first strange behaviour that we have encountered:

Before adding TCP_ANY access-list into class-map TCP_CLASS, we specified server's network and VIP directly. Here's how it looked like:

class-map match-any TCP_CLASS

2 match destination-address 10.0.100.10 255.255.255.255

3 match destination-address 10.0.101.128 255.255.255.240

With this configuration it seemed to be working.

An hour later, it turned out that we need servers (in 10.0.101.128/28) to make outgoing connections and we don't want ACE to tear them down after a default timeout.

So we added the "match access-list TCP_ANY" to the TCP_CLASS class-map (that's what it is now).

Just as we did that, all the connections that had been idle for longer than default timeout (3600 secs) went down.

As a temporary workaround we decided to disable the normalization (no normalization) on both (client-side and server-side) interfaces.

That seemed to help and everything was working, unless we got complains that idle connections are still going down after an hour (which means that ACE still uses normalization with default inactivity timeout).

Currently some of the connections in sh conn detail are idle for longer than 1 hour whereas others are apparently getting dropped.

The "Total Connections Timed-out:" in sh stats connection is increasing.

Also, when I started "capture" command on ACE and it started to flood into CLI, we've lost another 1k of "concurrent connections". I don't see any correlation between these and it may be just a coincidence, but we can't find another explanation.

We're getting really strange behaviour and it appears that we're facing a bug.

We have two ACE-20 6k modules, with the following software version:

Software

loader: Version 12.2[123]

system: Version A2(1.6a) [build 3.0(0)A2(1.6a) adbuild_08:46:04-2009/10/16_/auto/adbu-rel4/rel_a2_1_6_throttle/REL_3_0_0_A2_1_6A]

ohynderi · ‎08-25-2011

Even if normalization is disabled, idle timeout still applies. Only different (if not mistaken), with normalization disabled, ACE doesn't reset connection that times out. Beside this, the embryonic and the half-closed timeout don't apply when normalization is disabled.

Another remark, change of the idle timeout is only taken into account for new connections. Existing connections still use the old timeout.

For the rest, if you are suspecting a software bug, i would open a TAC case.

Thanks,

Olivier

AlexandrKry · ‎08-25-2011

Thanks for your remark about timed out connections with disabled normalization.

We removed the service-policy input TCPIP_POLICY configuration and reloaded the ACE. Connections now disappear from "sh conn" after an hour of being idle, but tcp sessions don't close and they appear in "sh conn" again as activity happens. So it kind of solved our problem.

We had another issue after reload. A few hours before reload we set weight 90 to one of our servers (Server1) (leaving other three with default weight 8) so that all the active connection will be on that server to easily terminate them . After reload, we changed it back to default, but all the connections (about 50) were going to that Server1 unless we set conn-limit on it. We tried to change others servers' weights and low down the Server1's weight to the lowest among other servers, but it didn't seem to work until we set the conn-limit. Why is that?

Also, the ditribution of connections is unequal among servers during normal operation. It's about 60-70 on the Server1 and Server2 and only 30-40 on servers Server3 and Server4. The load-balance policy we currently use is the default ("round-robin"). Should we switch to "leastconns" policy to achieve equal distribution?

We're planning to upgrade to the latest software version A2 (3.5) this night. Are there any issues we should be aware of while upgrading from 1.6a to 3.5? I read the release notes, but it's quite handful of them on the way from 1.x to 3.x (omitting the 2.x). We haven't seen any conflicting configuration commands so far. Did we miss something important?