CSS 11501 not dropping flows if service "down"

christopher.nemec · ‎10-15-2012

Dear community,

some misconfiguration (?) may be the reason for an undesired behaviour we are experiencing with our Cisco CSS 11501s. Balancing mechanisms work fine, however if a service transitions to the "down" state, the corresponding flows remain "alive" leading to a temporary outage of our service. Subsequent client requests are still being sent to the "down" frontend which is unresponsive.

Here is part of my config:

Version: 08.20.6.01

content oelg-oelg-oelgWettschein
    add service oelg-oelgWettschein-prod-rw-fe1
    add service oelg-oelgWettschein-prod-rw-fe2
    add service oelg-oelgWettschein-prod-mc-fe1
    add service oelg-oelgWettschein-prod-mc-fe2
    vip address 192.168.194.2
    port 7001
    protocol tcp
    flow-srvdown-reset
    add service oelg-oelgWettschein-prod-rw-fe3
    add service oelg-oelgWettschein-prod-rw-fe4
    add service oelg-oelgWettschein-prod-mc-fe3
    add service oelg-oelgWettschein-prod-mc-fe4
    balance leastconn
    flow-timeout-multiplier 57
    active

Port 7001 is running an SSL Server; since the clients are connected via WAN, we want to minimize protocol overhead and have persistent SSL connections without the need to re-negotiate SSL, exchange server/client certificates etc for every single request. Therefor the flow-timeout-multiplier is set to "57" (57*16 seconds=approx. 15 minutes). Both the server and the client also have an SSL timeout of 15 minutes; this feature works as designed.

This is (however) the picture, if a service goes down:

oelg-oelgWettschein-prod-mc-fe1 Alive       347      1     2           18
oelg-oelgWettschein-prod-mc-fe2 Alive       347      1   128           18
oelg-oelgWettschein-prod-mc-fe3 Alive       347      1   128           18
oelg-oelgWettschein-prod-mc-fe4 Alive       347      1     2           14
oelg-oelgWettschein-prod-rw-fe1 Alive       347      1     2           20
oelg-oelgWettschein-prod-rw-fe2 Down        320      1   255           22
oelg-oelgWettschein-prod-rw-fe3 Alive       346      1     2           14
oelg-oelgWettschein-prod-rw-fe4 Alive       346      1     2           12

For some reason (I believe misconfiguration on my part) the flows involving "oelg-oelgWettschein-prod-rw-fe2" (service marked as "down") are not being torn down. I have been diligently reviewing the config, but could not find the cause of this behavour. However it is reproduceable.

Does anyone have an input what I might be overlooking?

The desired behaviour in case of a service transition to the "down" state is as follows:

a) drop all flows involving the failed service

b) send TCP-RST to all clients which have active connections to the failed service

c) rebalance subsequent requests among the remaining "Alive" services

Thank you,

Christopher

Kanwaljeet Singh · ‎10-15-2012

Hi Christopher,

I remembered seeing this while working on a case where issue was due to service put in "suspend" mode.

I just double checked the configuration guide and here is what it says:

The state of the service. The State field displays the service as Alive, Dying, Down, or Suspended. The Dying state reports that a service is failing according to the parameters configured in the following service mode commands:

keepalive retryperiod

keepalive frequency

and keepalive maxfailure

( When a service enters the Down state, the CSS does not forward any new connections to it (the service is removed from the load-balancing rotation for the content rule). However, the CSS keeps all existing connections to the service (that is, connections to that service are not “torn down”).

And this behavior changes if you are using sticky. Please see the configuration guide for sticky-serverdown-failover. May be you would like to configure sticky based on src ip and dst port (which basically means that as long as src ip and dst port remains same the request would be sent to the same server) and that will change the behavior. Have a look below:

Configuring Sticky Serverdown Failover

The sticky failover default method is for the CSS to use the configured load-balancing method. Use the

sticky-serverdown-failover command to define what will happen if a sticky string is found but the associated service has failed or is suspended.

The syntax and options for this content mode command are:

sticky-serverdown-failover balance

- Sets the failover method to use a service based on the configured load-balancing method.

• sticky-serverdown-failover redirect

- Sets the failover method to use the redirect string configured on a content rule. This command supports a 252-character redirect string (URL)

. If you do not configure a redirect string on a content rule, the load-balancing method is used.

• sticky-serverdown-failover reject

- Rejects the content request.

• sticky-serverdown-failover sticky-srcip

- Sets the failover method to use a service based on the client source IP address.

• sticky-serverdown-failover sticky-srcip-dstport

- Sets the failover method to use a service based on the client source IP address and the server destination port.

Try this out if you can and let me know if that helps.

Regards,
Kanwal

christopher.nemec · ‎10-25-2012

I double checked if I have "sticky-serverdown-failover balance" in my content rule. It is the default value anyway and the loadbalancer should rebalance if a service goes down. However the active connections are still not being dropped and keep being forwarded to the "down" service. Are there any troubleshooting/debugging commands I can use to further investigate my problem?

Cesar Roque · ‎10-25-2012

Hi Christopher,

The explanation for this is that when a service is suspended or marked as down, this just

prevents the service from being used for new flows; existing flows will continue to use

the service. For TCP flows, the CSS has the command flow-reset-reject, which allows the

CSS to send a TCP RST when a flow is mapped to a destination IP address that is no longer

reachable

Test with flow-reset-reject in the Content Rule.

---------------------
Cesar R
ANS Team

--------------------- Cesar R ANS Team

christopher.nemec · ‎10-29-2012

Thank you for your input. The content rule is configured to send a "RST" if a frontend service becomes unavailable.

Please let me know, if my conclusions are correct:

a) the content rule needs to be configured with the "flow-reset-reject" command which will cause the CSS to send a "RST" to all TCP connections which are currently being connected to the "down" frontend

b) more important: the clients need to "obay" the TCP RST ; if they ignore it, they will not be rebalanced and thus be stuck on the dead frontend

c) clients need to create new flows via a three-way TCP handshake in order to be rebalanced to a service which is "alive"

Question b may seem a bit strange, but the client software has been custom designed and may not be 100% RFC compliant. Properly responding to TCP-RST seems to be the key issue here.