Consolidated Job Polling

Open ssadedin opened this issue 8 years ago • 1 comments

Bpipe currently polls each queuing system to determine if a job is still running, has finished successfully or exited with an error code.

There has been quite a bit of discussion about the pros and cons of this approach (see #211, #35) and in particular, the load placed on the queuing system by such frequent polling, especially for high concurrency jobs.

One solution that has been suggested is to use blocking jobs. However blocking jobs also place load on the server because the qsub (or equivalent) process stays running. A pipeline with hundreds of parallel jobs will quickly hit system resource limits (eg: file handles). So blocking jobs are not a solution.

Another solution is that Bpipe should not poll jobs at all. Bpipe currently wraps all jobs with shell code that writes out the exit code of the command to a file. If the file exists, then Bpipe knows the job is finished. So Bpipe could simply poll the file system. The problem is, non-existence of the file doesn't assure that the job is still running. It could have finished in a way that caused a hard exit of the job script, preventing it from writing an exit code file. How and when jobs get hard killed is dependent on the queuing system. In that scenario, Bpipe could be left waiting forever unless it polls for the job status. Perhaps Bpipe could poll very infrequently for job status in this scenario. A drawback here is that the queuing system will almost certainly have purged the job and thus Bpipe will not be able to retrieve any specific error information about why the job failed. This error information is often important to allow the user to adjust job parameters and correct the problem.

Another possibility is that Bpipe could rely on native features of the queuing system to get a callback when a job finishes. This is a great idea but it can't be implemented as a general mechanism, so in the first instance, something more general seems the most important thing to achieve.

Weighing all this up, the most general practical solution that could be implemented for all queuing systems seems to be to implement some kind of consolidated job polling. That is, if a pipeline has 100 concurrent jobs, it should not be polling 100 / 5 times per second (20 times/second). It should be consolidating these calls so that it does one query every 5 seconds for all the commands.

May 26 '17 23:05 ssadedin

Sadly not all job systems have blocking.

SGE has -sync, but PBS has no equivalent

May 27 '17 14:05 gdevenyi