CSS - Keepalives and Reboot

andrew.thomson · ‎11-13-2003

I have a problem with recognising a service as "Alive" rather than "Dead" on first keepalive attempt after a reboot.

The root problem appears to be the fact that the first keepalive that is executed from the "Service" is attempted before the physical port has become active.

This means that the first keepalive fails and the "Service" is left in its initial "Dead" status until the next keepalive is triggered at the keepalive frequency.

We have a requirement that if a service is noticed to be "Dead" it is taken out of Load Balancing algorithms.....and stays out!

The only way we have acheived this so far is to run a script under cmd-sched every minute that greps the output from a show service command for "State:" and tests the next field for "Dead". If it finds a Dead status, the script then removes the service from the rule (after adding a failover service).

*** First question: Is there a failover method that will take a service out of content rule and leave it out?

In the meantime, our method works fine....until....the system is rebooted (perhaps without operator intervention as in a power cycle).

When the system is rebooted the services all start in "Dead" status. Thus if our script runs before the first successful keepalive the service is taken out (prematurely) and manual intervention is then required to re-insert the service if it is really "alive". We have alerts for service transitions and our "operations" people can do this if required but it would be best to avoid it if possible.

One method I have been trying to configure is to use a "first time after boot" flag to insert a pause in either the keepalive scripts or the cmd-sched scripts. I can't find one!

*** Second Question: Is there a method from a script to determine if it is the first time run since boot.

*** Third Question: Is there a method to set a variable at boot. (and reset it via script)

*** Fourth question: Is there a method to carry a variable (which would not exist after a reboot) across iterations of a script run from cmd-sched?

peuchert · ‎11-13-2003

Hi,

my question is: Why do you want to take a service out of a content rule when it is mark as 'dead'? Maybe there is another way of providing that funcionality.

-alex

andrew.thomson · ‎11-13-2003

Alex,

Any service that fails may need remedial action before it is returned to the balancing rule.

Most of our services will be useable as soon as their ports are open and they respond to keepalive.

However we have Multiple LDAP suppliers (used for writes). But only balance client access to one of them. A rule with a single service.

When that single service fails we remove this service and add failover service. We wish to ensure the failed service stays out of client reach while we do remedial work to ensure it has all updates from the new master by replication on the standard ldap listening port before returning it to service.

The Load Balancer is not actually balancing anything just providing the ability to switch from one service to another, we would like this switch to be automatic but not fail back as would happen with a sorry-server config.

peucale · ‎11-19-2003

Hi,

sorry for not responding immediately.

How about using the output of 'show uptime' and checking its output against a startup time like 2 minutes? You could return 'alive' for values being smaller...

-alex

andrew.thomson · ‎11-19-2003

alex,

sorry, I should have said that I have tried doing this....it seemed the most obvious at the time....but...

I am having trouble with the output of show uptime and comparisons. The output can be put in a variable but I cannot make it into a number to do valid comparisons using the limited script capabilities on the CSS. Am I missing something?

Of course I could do it by monitoring from outside the CSS on a UNIX box but this is something I do not want to do.

Does anyone know if this is something that HSE might help with?