dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

Potential for SLURMCluster log directory creation failing

Open tjgalvin opened this issue 6 months ago • 1 comments

Describe the issue:

I am using dask_jobqueue.SLURMCluster in my workflows. I am running ~15 workflows at a time. I am getting unlucky and it seems like two separate workflows are both trying to create the output log directory. One would fail:

  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/core.py", line 661, in __init__
    self._dummy_job  # trigger property to ensure that the job is valid
    ^^^^^^^^^^^^^^^
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/core.py", line 690, in _dummy_job
    return self.job_cls(
           ^^^^^^^^^^^^^
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/slurm.py", line 37, in __init__
    super().__init__(
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/core.py", line 375, in __init__
    os.makedirs(self.log_directory)
  File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: 'flint_logs'

Minimal Complete Verifiable Example:

Given the 'race condition' involved I am not able to give a concise single example. I am guessing that it is a case that the two workflows are both getting past:

https://github.com/dask/dask-jobqueue/blob/d562e6c8c0560f8ab84d12edb70726f536fd551f/dask_jobqueue/core.py#L379

and subsequently only one is successfully able to make the directory.

Would it make sense to use Pathlib.Path here? Something like

if self.log_directory:
    log_path = Path(self.log_directory)
    log_path.mkdir(exists_ok=True, parent=True)

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version: Python 3.12.8
  • Operating System: SLES 15.5
  • Install method (conda, pip, source): pip
  • dask_jobqueue: 0.9.0

tjgalvin avatar Jul 28 '25 01:07 tjgalvin

Hi @tjgalvin, thanks for raising this!

os.makedirs also has an exist_ok kwarg that you can set to True (False by default). So I guess you could just use this, but Path should be alright too.

Would you open a PR to fix this?

guillaumeeb avatar Sep 26 '25 16:09 guillaumeeb