Solved: Tidal Master and Agent connection

miguel.aliaga · ‎02-23-2012

Is there any possibility to extend the following communication latency between the Tidal Master and Agent?

Tidal Master is pinging the Tidal Agent every 20 seconds and if after 6 pingins there's no respond, the master trigger an "Tidal Agent Down" event.

John Laird · ‎02-23-2012

Yes, you will have to add/set the following parameters in the master.props file which can be found on the master server under the config directory.

AgentHeartBeatInt=20

AgentHeartbeatFailureCount=6

The default parameters for AgentHeartBeatInt and AgentHeartbeatFailureCount in Master.props are set to the values above (20,6). You can update these parameters to increase the configured recovery interval before Tidal enters the agent down event. The way it works is as follows:

As a default, the master pings every 20 seconds and tries up to 6 times, which means the master will wait 120 seconds before entering an agent down event. As an example, you can add these parameters to the master.props as 25 and 7 to override the default setting so the master waits 175 seconds before entering an agent down event.These paramters would be set as below if you wanted to wait 175 seconds prior to raising an agent down event.

AgentHeartBeatInt=25

AgentHeartbeatFailureCount=7

In a fault tolerant environment these parameters should be added to both the PM and BM, and just like any other master.props setting, they don’t take effect until the master is restarted.

John

Tidal TAC CSE

View solution in original post

John Laird · ‎02-23-2012

Yes, you will have to add/set the following parameters in the master.props file which can be found on the master server under the config directory.

AgentHeartBeatInt=20

AgentHeartbeatFailureCount=6

The default parameters for AgentHeartBeatInt and AgentHeartbeatFailureCount in Master.props are set to the values above (20,6). You can update these parameters to increase the configured recovery interval before Tidal enters the agent down event. The way it works is as follows:

As a default, the master pings every 20 seconds and tries up to 6 times, which means the master will wait 120 seconds before entering an agent down event. As an example, you can add these parameters to the master.props as 25 and 7 to override the default setting so the master waits 175 seconds before entering an agent down event.These paramters would be set as below if you wanted to wait 175 seconds prior to raising an agent down event.

AgentHeartBeatInt=25

AgentHeartbeatFailureCount=7

In a fault tolerant environment these parameters should be added to both the PM and BM, and just like any other master.props setting, they don’t take effect until the master is restarted.

John

Tidal TAC CSE

miguel.aliaga · ‎02-23-2012

Thanks John for your support and help.

Miguel Aliaga

Lead Data Analyst | Data Management | IT & Operations | Lloyds International

E: Miguel.Aliaga@lloydsinternational.com.au<:MAILTO:MIGUEL.ALIAGA>

miguel.aliaga · ‎03-21-2012

Hi,

We are getting Tidal Agents dropped out alerts very frequently.

Alerts are triggered once a connection has been lost between Tidal Master and Agent and then connection gets automatically established in the same elapsed time, for instance

22/03/2012 12:11 AM Error Lost connection to agent: SAS

22/03/2012 12:11 AM Established connection with agent: AlertII_App_Server

We tried many things by monitoring the network (which is a little bit complex; VM, third party vendor) but couldn't get any clear reason why the drops out are happening, especially is you consider that other applications and processes are running at the same time without a similar disruption.

It's a little bit annoying especially if you consider that connection gets dropped after Master is pinging the agent 6 times every 20 seconds which means after 2 minutes without respond, which is a huge amount of time from a network perspective.

The drop outs rarely are impacting the jobs, in case it is, Tidal put the job in an Orphaned status. Usually in the event that a job is running while the alert is triggered, nothing happen to the job process which let us think that the problem is on how Tidal Master and Agents get communicated and connected.

We are still using TES 5.3

Does anyone has a similar scenario, please appreciate if you can share your experience.

Thanks.

Carolanne Fougerat · ‎09-10-2013

We had lots of false alerts too, and sometimes patching bounces that we really didn't care to know about.

VM bounce real fast too so by the time we check, connection is already green.

From within the GUI - On System Configuration (Other Tab) - there is a setting called

wait ### seconds for connection to recover before alerting in 'Lost Connection to Agent/Adapter'

We changed that value from 0 to 300 seconds so that only after something is down for 5 minutes do we get alerted.

That falls within our Service Level agreement with our users so we increased it to that.

Normally though on a patching outage, we diable agent and renable at the end of window to keep from getting alerts. That was because there was a bug in 5.3 where we weren't getting alerted when an agent goes red after an outage.

THis appears to work in 6.1 now so we have started using outages again on our 6.1 instance.

amit_vasdev · ‎05-18-2015

Hi,

We are using TES 6.1 and we are facing the same issue as miguel.

Lost connection to agent :agent name

and with in seconds it comes back but our jobs go to orphan status which is creating a problem for us.

Appreciate if somebody can advise on this.