Jobs hanging in "Launched" status on some agents

Carol Kelpin · ‎10-01-2013

I have a ticket in with Cisco but no response yet so thought I'd check here.

I have Tidal 6.0.3 which has been relatively stable for 2 months now. I have some windows servers with agents 3.1.0.9 running several months, and now jobs hange in "launched" status on these.

When this has happened in the past a restart of the agent, or reboot of the agent host has resolved and jobs could run again, this time its not working. event restarted all Tidal services to see if it cleared up anything with same results.

No changes/updates to the servers, has anyone had this before? Can someone point out a resolution to look at? My Scheduler group are starting to hound me on getting these agents going again.

Marc Clasby · ‎10-01-2013

Lauched is a state before job goes active

Job lifecycle from help:

Waits in the production schedule for its dependencies to be met.

Enters a queue and waits for an execution slot to become available.

Launches on its designated agent.

Starts execution successfully on its designated agent.

Completes normally.

so the agent should have been assigned the job by the master and it is getting ready for execution (goes active)

it should be getting a PID

does the job status tab have a External ID? and does that ID/PID exist on the Tidal Agent as being active?

(Tidal External ID= Server PID)

Remote to Agent... open Task Manager ...Process Tab .. select menu item View ...Select columns..Choose PID (Process Identifier). Make sure the check box is checked for [x] Show processes from all users

look for the External ID in the PID column..

if it is there (probalby using no cpu/mem) then problem is likely on agent side and could be code itself..

if it is not there (more likely) than the master was unable to commuicate with Agent and you can investigate the master logs (check agent communicaiton port, increase logging level,/high debug,get Cisco to assist, check network, etc)

Hope this helps

Tracy Donmoyer · ‎10-01-2013

"Launched" status means the master sent a request to the agent, but the agent did not process it. Since a reboot did not resolve the problem, I recommend you try deleted the agent working directory.

From the Tidal Client, disable the agent (Administration, Connections)
Log on to the Windows server
Stop the agent service
Go to the Tidal Agent directory - \Program Files\TIDAL\Agent
Delete the TIDAL_AGENT_1 directory.
Restart the agent service, this will recreate the TIDAL_AGENT_1 directory
From the Tidal Client, enable the agent

Good luck.

Carolanne Fougerat · ‎10-03-2013

In 5.3, when all else failed (the suggestions above), I used sacmd to force the status of the jobs to completed normally or completed abnormally depending on what users need.

Obviously, if underlying problem is not fixed (i.e. network issue, etc) the stuck in launch with continue.

In 6.1 the stuck in launched I have seen seem to recover with a failover (or master bounce) like you mentioned.

Carol Kelpin · ‎10-03-2013

looks like a corrupted file event or bad file event (although these file events have been running for several months or longer). Spent 2 hours with Support, and altough only 14 file event jobs associated to this agent, one of them was the culprit to hange up response. disabled all file events, restarted the agent, jobs worked, enabled the file events, jobs still work.

Carolanne Fougerat · ‎10-03-2013

Thanks for updating us. I always find it helpful to know different things to look for. Did they say this was bug related or a weird anomaly? We are on a different version but always something good to watch out for.