10-01-2013 01:41 PM - last edited on 03-25-2019 01:31 PM by ciscomoderator
I have a ticket in with Cisco but no response yet so thought I'd check here.
I have Tidal 6.0.3 which has been relatively stable for 2 months now. I have some windows servers with agents 3.1.0.9 running several months, and now jobs hange in "launched" status on these.
When this has happened in the past a restart of the agent, or reboot of the agent host has resolved and jobs could run again, this time its not working. event restarted all Tidal services to see if it cleared up anything with same results.
No changes/updates to the servers, has anyone had this before? Can someone point out a resolution to look at? My Scheduler group are starting to hound me on getting these agents going again.
10-01-2013 01:59 PM
Lauched is a state before job goes active
Job lifecycle from help:
Waits in the production schedule for its dependencies to be met.
Enters a queue and waits for an execution slot to become available.
Launches on its designated agent.
Starts execution successfully on its designated agent.
Completes normally.
so the agent should have been assigned the job by the master and it is getting ready for execution (goes active)
it should be getting a PID
does the job status tab have a External ID? and does that ID/PID exist on the Tidal Agent as being active?
(Tidal External ID= Server PID)
Remote to Agent... open Task Manager ...Process Tab .. select menu item View ...Select columns..Choose PID (Process Identifier). Make sure the check box is checked for [x] Show processes from all users
look for the External ID in the PID column..
if it is there (probalby using no cpu/mem) then problem is likely on agent side and could be code itself..
if it is not there (more likely) than the master was unable to commuicate with Agent and you can investigate the master logs (check agent communicaiton port, increase logging level,/high debug,get Cisco to assist, check network, etc)
Hope this helps
10-01-2013 03:08 PM
"Launched" status means the master sent a request to the agent, but the agent did not process it. Since a reboot did not resolve the problem, I recommend you try deleted the agent working directory.
Good luck.
10-03-2013 11:27 AM
In 5.3, when all else failed (the suggestions above), I used sacmd to force the status of the jobs to completed normally or completed abnormally depending on what users need.
Obviously, if underlying problem is not fixed (i.e. network issue, etc) the stuck in launch with continue.
In 6.1 the stuck in launched I have seen seem to recover with a failover (or master bounce) like you mentioned.
10-03-2013 12:50 PM
looks like a corrupted file event or bad file event (although these file events have been running for several months or longer). Spent 2 hours with Support, and altough only 14 file event jobs associated to this agent, one of them was the culprit to hange up response. disabled all file events, restarted the agent, jobs worked, enabled the file events, jobs still work.
10-03-2013 03:01 PM
Thanks for updating us. I always find it helpful to know different things to look for. Did they say this was bug related or a weird anomaly? We are on a different version but always something good to watch out for.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide