I have a ticket in with Cisco but no response yet so thought I'd check here.
I have Tidal 6.0.3 which has been relatively stable for 2 months now. I have some windows servers with agents 18.104.22.168 running several months, and now jobs hange in "launched" status on these.
When this has happened in the past a restart of the agent, or reboot of the agent host has resolved and jobs could run again, this time its not working. event restarted all Tidal services to see if it cleared up anything with same results.
No changes/updates to the servers, has anyone had this before? Can someone point out a resolution to look at? My Scheduler group are starting to hound me on getting these agents going again.
Lauched is a state before job goes active
Job lifecycle from help:
Waits in the production schedule for its dependencies to be met.
Enters a queue and waits for an execution slot to become available.
Launches on its designated agent.
Starts execution successfully on its designated agent.
so the agent should have been assigned the job by the master and it is getting ready for execution (goes active)
it should be getting a PID
does the job status tab have a External ID? and does that ID/PID exist on the Tidal Agent as being active?
(Tidal External ID= Server PID)
Remote to Agent... open Task Manager ...Process Tab .. select menu item View ...Select columns..Choose PID (Process Identifier). Make sure the check box is checked for [x] Show processes from all users
look for the External ID in the PID column..
if it is there (probalby using no cpu/mem) then problem is likely on agent side and could be code itself..
if it is not there (more likely) than the master was unable to commuicate with Agent and you can investigate the master logs (check agent communicaiton port, increase logging level,/high debug,get Cisco to assist, check network, etc)
Hope this helps
"Launched" status means the master sent a request to the agent, but the agent did not process it. Since a reboot did not resolve the problem, I recommend you try deleted the agent working directory.
In 5.3, when all else failed (the suggestions above), I used sacmd to force the status of the jobs to completed normally or completed abnormally depending on what users need.
Obviously, if underlying problem is not fixed (i.e. network issue, etc) the stuck in launch with continue.
In 6.1 the stuck in launched I have seen seem to recover with a failover (or master bounce) like you mentioned.
looks like a corrupted file event or bad file event (although these file events have been running for several months or longer). Spent 2 hours with Support, and altough only 14 file event jobs associated to this agent, one of them was the culprit to hange up response. disabled all file events, restarted the agent, jobs worked, enabled the file events, jobs still work.
Thanks for updating us. I always find it helpful to know different things to look for. Did they say this was bug related or a weird anomaly? We are on a different version but always something good to watch out for.