toil icon indicating copy to clipboard operation
toil copied to clipboard

Slurm srun blocking of resources

Open stevenayoub opened this issue 3 years ago • 2 comments

Hello,

We are currently using Toil python API run a series of molecular simulations (within a python script) on a Slurm cluster. The python script is submitted through sbatch which specifies a maximum count of tasks per node to 12. Toil seems to utilizes all 12 cores throughout the workflow if we execute our parallel simulations using mpirun. However if we execute our parallel simulations with srun there seems to a blocking of resources and jobs which results in: srun: Job 142324 step creation still disabled, retrying (Requested nodes are busy). Toil continues with the workflow but only a single parallel simulation is running at a time, while there cores not being used for the workflow.

If this is a Toil issue and you have any reasons why this may be occurring would be great to hear back, thank you.

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1225

stevenayoub avatar Sep 29 '22 00:09 stevenayoub

@sja58429 Do you have a a minimum viable script that can reproduce this behaviour?

Hexotical avatar Sep 29 '22 21:09 Hexotical

Hello @Hexotical, Here's a minimal example of the problem. I run an interactive session and allocated 6 cores, however for the executable with srun -n 2 only runs a single job at a time.

srun_hello.py.zip

stevenayoub avatar Sep 29 '22 22:09 stevenayoub