Page 1 of 2

Farm Machine Not Picking Up Jobs

Posted: Tue Oct 20, 2020 9:11 am
by ernie1901
Hi all,

I have FFAStrans installed on a network share. All the PCs in my farm can access this network share. Most of the time this works great.
However I've seen at random time periods a PC can stop taking jobs from the queue. Its still online I can ping the API via that machine and it will still accept API job submissions or return the JSON of job history, it just wont process anything new from the queue.

A Windows reboot fixes it and the machine takes jobs again. There seems to be no pattern to this, it can be within hours or days of the reboot the machine stops processing new jobs. Its not the CPU limit either the PC isnt busy during the time in which it can and then cannot recieve new jobs.
It seems to render the machine pretty useless other than API access as even a right click manual submit on a workflow doesnt make the machine pick up the job.

Have you seen this behaviour before or do you have an idea of what could be causing it? I'm running Windows 10.

Thanks.

Re: Farm Machine Not Picking Up Jobs

Posted: Tue Oct 20, 2020 10:17 am
by momocampo
Hello Ernie,
Ok so first, is it still the same host? (the same pc that doesn't take jobs)
When the issue comes, have you check the "FFAStrans rest-api service" is still "running"? (Manage/services)
I have sometimes had strange behaviors, for example the rest service has stopped for no reason.
;)
B.

Re: Farm Machine Not Picking Up Jobs

Posted: Tue Oct 20, 2020 10:31 am
by ernie1901
Thanks for the quick reply @momocampo.

I've had it occur on 3 of 6 hosts so far.
The service is still running, I can even still send it a GET/POST request and it responds. It just doesnt do any jobs, its really strange!
If I try to stop the hosts API service and restart it the Services window crashes and stalls in 'Stopping' state, I have to reboot the machine completely.

Re: Farm Machine Not Picking Up Jobs

Posted: Tue Oct 20, 2020 10:42 am
by emcodem
Hey ernie,

i guess you are running Version 1.1?

Maybe you can check out the "job ticket management" described here, check if the exe_manager is running and also check out the filesystem /tickets folder as described?
http://ffastrans.com/wiki/doku.php?id=system:processes

It is mostly interesting if there are ticket files and if yes, in which folder, in the temp folder or running etc...?
If i am correct, only one Server from the farm stops working. When thats the case, log on on the server using the same username/Pw as your ffastrans service is set to run as and check out the tickets on the central quorum location. Maybe you see some error message from windows explorer when you want to access the quorum share

[EDIT] Sorry i need to correct myself, if it is like only one machine in the farm stops to pick up tickets, then it is not interesting in which folders you actually see tickets but instead only the second question, so the access to the quorum share from the machine that stopped working is really interesting. If it can still access the quorum location, try to write a file to the /db folder too please.

Re: Farm Machine Not Picking Up Jobs

Posted: Tue Oct 20, 2020 12:22 pm
by ernie1901
Yes V1.1. Some further observations:

The machine currently not picking up jobs has multiple instances of 'exe_manager' in Windows Task Manager, but this looks like expected behaviour from the documentation you linked to.
The machine currently not responding to jobs can read and write from the db folder with no issue in Windows Explorer.

The txt file in db/cache/exe_log with the hostname was last modified yesterday. The remainder of the working hosts have a txt file modified today.
The bad host JSON in db/configs/hosts has a last heartbeat time of over 24 hours ago, whereas working hosts are 10 minutes ago which also indicates it cant call back to this file for some reason, but I can open and edit it fine.

Re: Farm Machine Not Picking Up Jobs

Posted: Tue Oct 20, 2020 12:44 pm
by emcodem
Thats great info but unfortunately i still don't have a clue what could be the reason for this faults.

Is it possible for you to zip and send us the whole cache directory including all logs and such? if uploading here dont work, maybe you can use wetransfer and send us the link in a PM?
\Processors\db\cache

Re: Farm Machine Not Picking Up Jobs

Posted: Thu Oct 22, 2020 9:11 am
by ernie1901
Hey @emcodem,

Confirming that so far (24 hours or so) this seems to be fixed on 1.1.1.0.
I will monitor and report back if that changes.

Re: Farm Machine Not Picking Up Jobs

Posted: Thu Oct 22, 2020 10:39 am
by admin
Great, but you need to be aware of that the 1.1.1.0 is not released yet and may be prone to other bugs. I guess frank informed you about that ;-)

-steinar

Re: Farm Machine Not Picking Up Jobs

Posted: Tue Dec 29, 2020 10:06 am
by Den
I have the similar problem. Still on 0.9.4 due to the issue ernie1901 described. When moved to 1.1.0.2 the jobs are only taken by 2 machine whereas the other 4 machines are being ignored. Almost same HW & SW... Hope this gets addressed in next update. Blocking me to upgrade. THanks

Re: Farm Machine Not Picking Up Jobs

Posted: Wed Dec 30, 2020 12:24 pm
by ernie1901
Den wrote: Tue Dec 29, 2020 10:06 am I have the similar problem. Still on 0.9.4 due to the issue ernie1901 described. When moved to 1.1.0.2 the jobs are only taken by 2 machine whereas the other 4 machines are being ignored. Almost same HW & SW... Hope this gets addressed in next update. Blocking me to upgrade. THanks
What do you have set as your windows service recovery settings?
Mine was set to ignore any failures. You cna change it to restart the service.

Yes it shouldnt fail but this at least restarts the Windows Services if they fall over, whilst the issue is investigated.