Potential for SLURMCluster log directory creation failing
Describe the issue:
I am using dask_jobqueue.SLURMCluster in my workflows. I am running ~15 workflows at a time. I am getting unlucky and it seems like two separate workflows are both trying to create the output log directory. One would fail:
File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/core.py", line 661, in __init__
self._dummy_job # trigger property to ensure that the job is valid
^^^^^^^^^^^^^^^
File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/core.py", line 690, in _dummy_job
return self.job_cls(
^^^^^^^^^^^^^
File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/slurm.py", line 37, in __init__
super().__init__(
File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/site-packages/dask_jobqueue/core.py", line 375, in __init__
os.makedirs(self.log_directory)
File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: 'flint_logs'
Minimal Complete Verifiable Example:
Given the 'race condition' involved I am not able to give a concise single example. I am guessing that it is a case that the two workflows are both getting past:
https://github.com/dask/dask-jobqueue/blob/d562e6c8c0560f8ab84d12edb70726f536fd551f/dask_jobqueue/core.py#L379
and subsequently only one is successfully able to make the directory.
Would it make sense to use Pathlib.Path here? Something like
if self.log_directory:
log_path = Path(self.log_directory)
log_path.mkdir(exists_ok=True, parent=True)
Anything else we need to know?:
Environment:
- Dask version:
- Python version: Python 3.12.8
- Operating System: SLES 15.5
- Install method (conda, pip, source): pip
- dask_jobqueue: 0.9.0
Hi @tjgalvin, thanks for raising this!
os.makedirs also has an exist_ok kwarg that you can set to True (False by default). So I guess you could just use this, but Path should be alright too.
Would you open a PR to fix this?