exp-workers failing without logs
Bug Report
DVC EXP workers dying
Running multiple workers results in failed experiments and no logs
Description
Launching dvc queue start with parameter -j greater than 1 fails some experiments that shouldn't fail and these experiments will have no logs. Furthermore, sometimes the exp-worker dies with the failed experiments.
Reproduce
Example:
params.yaml
value: 1
dvc.yaml
stages:
experiment_candles:
cmd: sleep 5 ; echo DONE
params:
- params.yaml:
-
git init -
dvc init - Copy dvc.yaml
- Copy params.yaml
-
git add *.yaml -
git commit -m "initial commit" - Queue experiments
dvc exp run \
--queue \
--set-param "value=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99"
- Start 20 jobs
dvc queue start -j 20
- Check for failed jobs
dvc queue status | grep Failed
- Check logs of failed jobs
dvc queue logs ...
Note: that it doesn't always fail, so maybe you have to iterate starting at step 7.
Output sample
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
dvc_data = 3.16.7
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.9
Supports:
azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
gdrive (pydrive2 = 1.21.3),
gs (gcsfs = 2024.12.0),
hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
oss (ossfs = 2023.12.0),
s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
ssh (sshfs = 2024.9.0),
webdav (webdav4 = 0.10.0),
webdavs (webdav4 = 0.10.0),
webhdfs (fsspec = 2024.12.0)
Config:
Global: /Users/pichurri/Library/Application Support/dvc
System: /Users/pichurri/homebrew/share/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Users/pichurri/homebrew/var/cache/dvc/repo/7b5c17002f7a7963a4dc1afee2b961e2
I am experiencing the same issue. I think the only pattern I am noticing is that more jobs fail as I increase the number of parallel jobs
DVC version: 3.59.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.15.0-124-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.16.9
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.10
Supports:
http (aiohttp = 3.11.16, aiohttp-retry = 2.9.1),
https (aiohttp = 3.11.16, aiohttp-retry = 2.9.1),
s3 (s3fs = 2024.12.0, boto3 = 1.37.1)
Config:
Global: /home/mlbecke/.config/dvc
System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/sdb4
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/fecca1071a5dc271668d751b3e90d0d8
Could you please check the files in .dvc/tmp/exps/celery directory? Do you see any error logs there?
@skshetry I replayed @nv-pipo's demonstration of the issue, and I can see the following error in the various celery workers logs for each failed job:
[2025-08-12 09:03:12,827: ERROR/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[8228debd-a0a8-4c34-a2b2-fe7f91b45751] raised unexpected: FileLocked(b'/Users/mlbecke/dvc-10673/.git/packed-refs', b'/Users/mlbecke/dvc-10673/.git/packed-refs.lock')
Traceback (most recent call last):
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/file.py", line 144, in __init__
fd = os.open(
^^^^^^^^
FileExistsError: [Errno 17] File exists: b'/Users/mlbecke/dvc-10673/.git/packed-refs.lock'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
R = retval = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/queue/tasks.py", line 112, in run_exp
executor = setup_exp.s(entry_dict)()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/canvas.py", line 349, in __call__
return self.type(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 737, in __protected_call__
return orig(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/task.py", line 411, in __call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/queue/tasks.py", line 34, in setup_exp
executor = BaseStashQueue.init_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/queue/base.py", line 518, in init_executor
executor.init_git(
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/funcy/decorators.py", line 47, in wrapper
return deco(call, *dargs, **dkwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/funcy/flow.py", line 99, in retry
return call()
^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/funcy/decorators.py", line 68, in __call__
return self._func(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/executor/local.py", line 101, in init_git
self.set_temp_refs(scm, temp_ref_dict),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/.local/share/uv/python/cpython-3.12.9-macos-aarch64-none/lib/python3.12/contextlib.py", line 144, in __exit__
next(self.gen)
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/executor/base.py", line 825, in set_temp_refs
scm.remove_ref(ref)
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/scmrepo/git/__init__.py", line 308, in _backend_func
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 573, in remove_ref
if not self.repo.refs.remove_if_equals(name_b, old_ref_b):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/refs.py", line 1072, in remove_if_equals
self._remove_packed_ref(name)
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/refs.py", line 841, in _remove_packed_ref
f = GitFile(filename, "wb")
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/file.py", line 85, in GitFile
return _GitFile(filename, mode, bufsize, mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/file.py", line 150, in __init__
raise FileLocked(filename, self._lockfilename) from exc
dulwich.file.FileLocked: (b'/Users/mlbecke/dvc-10673/.git/packed-refs', b'/Users/mlbecke/dvc-10673/.git/packed-refs.lock')
@mattlbeck @nv-pipo, I have opened a PR: #10840. Could you test it out and see if it resolves the issue?
Locally, I am still seeing FileNotFoundError and OSError when I increase the amount of jobs and concurrency. The errors also appear to be quite confusing.
File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 85, in GitFile
return _GitFile(filename, mode, bufsize, mask)
File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 144, in __init__
fd = os.open(
self._lockfilename,
os.O_RDWR | os.O_CREAT | os.O_EXCL | getattr(os, "O_BINARY", 0),
mask,
)
OSError: [Errno 22] Invalid argument: b'/private/var/folders/3g/1vds4g8d4p3909hrwr65j6300000gn/T/tmp.hMGmSKWkAJ/.git/refs/exps/temp/head-aX7recLVy4xhzRcGVPL4w9.lock'
File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 85, in GitFile
return _GitFile(filename, mode, bufsize, mask)
File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 144, in __init__
fd = os.open(
self._lockfilename,
os.O_RDWR | os.O_CREAT | os.O_EXCL | getattr(os, "O_BINARY", 0),
mask,
)
FileNotFoundError: [Errno 2] No such file or directory: b'/private/var/folders/3g/1vds4g8d4p3909hrwr65j6300000gn/T/tmp.hMGmSKWkAJ/.git/refs/exps/temp/head-ABtmsLg4wJvDGBeXVsSMyP.lock'
EDIT: looks like this is happening because dulwich removes parent refs directories while another process maybe trying to write to the same directory.
The issue is in Git reference management:
Each experiment creates temporary Git references (refs/exps/temp/...) Multiple workers compete for access to Git's internal files Git uses file locking to prevent corruption, but DVC isn't handling lock conflicts properly
DVC isn't handling lock conflicts properly
Well, we try too. These changes are done behind an rwlock. Here those refs are packed, making that no longer safe.
The rwlock protection breaks down because packed refs bypass the individual file locking mechanism
Detect when refs get packed and fall back to repo-level locking
or Move to isolated workspaces per worker so packing doesn't affect other workers
better still Move away from shared Git state entirely. Each worker gets its own Git workspace, eliminating the race condition at the source rather than trying to coordinate around it. This scales better and is more robust than trying to perfect the locking dance around Git's internal optimizations.