Solved: CSM : Strange Round-Robin behaviour

yves.haemmerli · ‎08-28-2006

Hi,

During validation tests, I am observing a strange server selection behaviour, on independent GET requests (no cookie included) : When a connection request arrives in the CSM within a time window of about 20 seconds after the previous request, then the CSM correctly selects the next real server in the farm (round-robin). However, if the second connection arrives more than 20 sec after the previous request, the CSM selects the same server as for the previous one. Everything looks like the round-robin algorithm would be "reset" after this periood of time.

Is it a normal behaviour ?

By the way, how is the server list organized in the CSM RR algorithm ?

Thank you

Yves Haemmerli

Gilles Dufour · ‎08-31-2006

Yves,

I just do not see this behavior with the same setup and sending the same GET request and having my server returning a 401 as well.

The only difference is that your server closes the connection with a FIN and the FIN packet contains data. My server sends the FIN separately.

Anyway, from what you are doing it seems like you are more interested in 'predictor leastconn' then roundrobin.

Because the command you capture is the total active connections. The roundrobin algorithm does not look at active connections but the total established connections.

So you should look at the command 'sho mod csm 1 real sfarm PORTAL-PROD det'

Could you change the predictor algorithm to leastconn and see if it's what you need.

Gilles.

View solution in original post

Gilles Dufour · ‎08-28-2006

Yves,

Be aware that the default weight is 8, so you may see a few connections going to one server and then a few connections going to the other server.

So, with short number of connections, you may see irregular loadbalancing, however on the long term, the number of connections going to each server should be more or less the same.

You may try to set the weight for each real to 1 for your own tests and then set it back to the default.

Verify also that you are looking at connections and not requests.

The CSM loadbalances connections. So if you have many requests per connection, they will go to the same server.

Gilles.

yves.haemmerli · ‎08-28-2006

Gilles,

I did the test with weight = 1 on each of the eight server of the farm, but the same symptom occurs. In order to investigate, I wrot a TCL script on my PC that opens new connections to the CSM and sends a http GET, as would do a client station. Once opened, I keep the TCP session open for one hour. Note that I don't send any cookie in the GET requests. With this script, I can generate multiple sessions. If I tell the script to generate 32 sessions, sessions will be established one after the other with a small delay between them. in this case, the CSM perfectly distribute them on all servers. But if the script generates only one session, and I start the script every five seconds, then the distribution is quite uneven :

Note: a weight of 1 is configured :

DESIT520 160.213.139.163 inService 18

DESIT519 160.213.139.164 inService 15

DESIT518 160.213.139.165 inService 8

DESIT517 160.213.139.166 inService 5

DESIT020 160.213.139.171 inService 2

DESIT019 160.213.139.172 inService 1

DESIT018 160.213.139.173 inService 0

DESIT017 160.213.139.174 inService 0

The CSM Code is 2.1(2a)

I really don't see where the problem could be... Is there any timer that would reset the round-robin selection ?

Yves Haemmerli

Gilles Dufour · ‎08-28-2006

could you send the part of the config involved for this test.

Thanks,

Gilles.

yves.haemmerli · ‎08-29-2006

Hi Gilles,

I attach the relevant part of the CSM configuration. Notice that I configured weight = 1 as you suggested for the test.

I also tried to reload the CSM, but the problem remains.

Thank you for your help

Yves

Gilles Dufour · ‎08-29-2006

I just did the test and everything is ok for me.

gdufour-cat6k-2#sho mod csm 3 serv name linux1-all det

LINUX1-ALL, type = SLB, predictor = RoundRobin

nat = SERVER, CLIENT(RTSP)

virtuals inservice = 3, reals = 8, bind id = 0, fail action = none

inband health config:

retcode map =

Real servers:

L1:80, weight = 1, OPERATIONAL, conns = 1

L1:81, weight = 1, OPERATIONAL, conns = 1

L1:82, weight = 1, OPERATIONAL, conns = 1

L1:83, weight = 1, OPERATIONAL, conns = 1

L1:84, weight = 1, OPERATIONAL, conns = 1

L1:85, weight = 1, OPERATIONAL, conns = 1

L1:86, weight = 1, OPERATIONAL, conns = 1

L1:87, weight = 1, OPERATIONAL, conns = 1

Total connections = 8

gdufour-cat6k-2#

Could you sniff the CSM portchannel and run your test.

Capture the same show command and send me everything.

Gilles.

yves.haemmerli · ‎08-29-2006

Hi Gilles,

OK, I will setup a trace. it will take some time as the impacted portal is in a remote data center. As soon as I have the information, I will send it to you in a append in this forum.

Thanks again for your support,

Yves Haemmerli

yves.haemmerli · ‎08-30-2006

Gilles,

As you requested, I send you the traces showing that when sessions are established with long delay between them, the round-robin load balancing is not consistent.

Here are some important infos :

- Client IP address is 141.122.142.197

- VIP address is 160.213.139.14

- We NAT the client with the VIP address

- Servers addresses are :

-> 160.213.139.163

-> 160.213.139.164

-> 160.213.139.165

-> 160.213.139.166

-> 160.213.139.171

-> 160.213.139.172

-> 160.213.139.173

-> 160.213.139.174

I ran two tests. Fore each of them I send you one trace showing the frames (HTTP on TCP port 26000)between the client and the CSM, and a second trace showing the frames between the CSM and the servers. In the trace, please forget about the HTTP code 401 returned by the servers). Also, note that the sessions are kept open by my session generator, in order to do the test.

In the first test, I sent 16 sessions in the raw without delay between them. Load balancing is perfect, each of the eight servers receives 2 sessions.

Than, I sent 16 sessions, one after the other, with several seconds between them. As you can see, the load balancing is uneven in this case.

I can't understand the behaviour as the GET requests in both tests are exactly the same...

Thank you for your help,

Yves Haemmerli

Gilles Dufour · ‎08-31-2006

Yves,

I just do not see this behavior with the same setup and sending the same GET request and having my server returning a 401 as well.

The only difference is that your server closes the connection with a FIN and the FIN packet contains data. My server sends the FIN separately.

Anyway, from what you are doing it seems like you are more interested in 'predictor leastconn' then roundrobin.

Because the command you capture is the total active connections. The roundrobin algorithm does not look at active connections but the total established connections.

So you should look at the command 'sho mod csm 1 real sfarm PORTAL-PROD det'

Could you change the predictor algorithm to leastconn and see if it's what you need.

Gilles.

yves.haemmerli · ‎08-31-2006

Gilles,

You are right, predictor leastconns is maybe more what we want to have. I tested it yesterday and load balancing is now consistent, even with long delays between connections. So I can say that it solved our problem. We have now to test the slow-start behaviour of the CSM to have a final validatation of the solution

Thanks again Gilles for your assistance

Yves