dvc icon indicating copy to clipboard operation
dvc copied to clipboard

exp-workers failing without logs

Open nv-pipo opened this issue 1 year ago • 1 comments

Bug Report

DVC EXP workers dying

Running multiple workers results in failed experiments and no logs

Description

Launching dvc queue start with parameter -j greater than 1 fails some experiments that shouldn't fail and these experiments will have no logs. Furthermore, sometimes the exp-worker dies with the failed experiments.

Reproduce

Example:

params.yaml

value: 1

dvc.yaml

stages:
  experiment_candles:
    cmd: sleep 5 ; echo DONE
    params:
      - params.yaml:
  1. git init
  2. dvc init
  3. Copy dvc.yaml
  4. Copy params.yaml
  5. git add *.yaml
  6. git commit -m "initial commit"
  7. Queue experiments
dvc exp run \
    --queue \
    --set-param "value=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99"
  1. Start 20 jobs
dvc queue start -j 20
  1. Check for failed jobs
dvc queue status | grep Failed
  1. Check logs of failed jobs
dvc queue logs ...

Note: that it doesn't always fail, so maybe you have to iterate starting at step 7.

Output sample

Image

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.9
Supports:
        azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.3),
        gs (gcsfs = 2024.12.0),
        hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
        http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.12.0)
Config:
        Global: /Users/pichurri/Library/Application Support/dvc
        System: /Users/pichurri/homebrew/share/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Users/pichurri/homebrew/var/cache/dvc/repo/7b5c17002f7a7963a4dc1afee2b961e2

nv-pipo avatar Jan 16 '25 12:01 nv-pipo

I am experiencing the same issue. I think the only pattern I am noticing is that more jobs fail as I increase the number of parallel jobs

DVC version: 3.59.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.15.0-124-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.9
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.10
Supports:
        http (aiohttp = 3.11.16, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.11.16, aiohttp-retry = 2.9.1),
        s3 (s3fs = 2024.12.0, boto3 = 1.37.1)
Config:
        Global: /home/mlbecke/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/sdb4
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/fecca1071a5dc271668d751b3e90d0d8

mattlbeck avatar May 07 '25 15:05 mattlbeck

Could you please check the files in .dvc/tmp/exps/celery directory? Do you see any error logs there?

skshetry avatar Aug 08 '25 05:08 skshetry

@skshetry I replayed @nv-pipo's demonstration of the issue, and I can see the following error in the various celery workers logs for each failed job:

[2025-08-12 09:03:12,827: ERROR/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[8228debd-a0a8-4c34-a2b2-fe7f91b45751] raised unexpected: FileLocked(b'/Users/mlbecke/dvc-10673/.git/packed-refs', b'/Users/mlbecke/dvc-10673/.git/packed-refs.lock')
Traceback (most recent call last):
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/file.py", line 144, in __init__
    fd = os.open(
         ^^^^^^^^
FileExistsError: [Errno 17] File exists: b'/Users/mlbecke/dvc-10673/.git/packed-refs.lock'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/queue/tasks.py", line 112, in run_exp
    executor = setup_exp.s(entry_dict)()
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/canvas.py", line 349, in __call__
    return self.type(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 737, in __protected_call__
    return orig(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/celery/app/task.py", line 411, in __call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/queue/tasks.py", line 34, in setup_exp
    executor = BaseStashQueue.init_executor(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/queue/base.py", line 518, in init_executor
    executor.init_git(
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/funcy/flow.py", line 99, in retry
    return call()
           ^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/executor/local.py", line 101, in init_git
    self.set_temp_refs(scm, temp_ref_dict),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/.local/share/uv/python/cpython-3.12.9-macos-aarch64-none/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dvc/repo/experiments/executor/base.py", line 825, in set_temp_refs
    scm.remove_ref(ref)
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/scmrepo/git/__init__.py", line 308, in _backend_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 573, in remove_ref
    if not self.repo.refs.remove_if_equals(name_b, old_ref_b):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/refs.py", line 1072, in remove_if_equals
    self._remove_packed_ref(name)
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/refs.py", line 841, in _remove_packed_ref
    f = GitFile(filename, "wb")
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/file.py", line 85, in GitFile
    return _GitFile(filename, mode, bufsize, mask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlbecke/dvc-10673/.venv/lib/python3.12/site-packages/dulwich/file.py", line 150, in __init__
    raise FileLocked(filename, self._lockfilename) from exc
dulwich.file.FileLocked: (b'/Users/mlbecke/dvc-10673/.git/packed-refs', b'/Users/mlbecke/dvc-10673/.git/packed-refs.lock')

mattlbeck avatar Aug 12 '25 08:08 mattlbeck

@mattlbeck @nv-pipo, I have opened a PR: #10840. Could you test it out and see if it resolves the issue?

Locally, I am still seeing FileNotFoundError and OSError when I increase the amount of jobs and concurrency. The errors also appear to be quite confusing.

File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 85, in GitFile
    return _GitFile(filename, mode, bufsize, mask)
  File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 144, in __init__
    fd = os.open(
        self._lockfilename,
        os.O_RDWR | os.O_CREAT | os.O_EXCL | getattr(os, "O_BINARY", 0),
        mask,
    )
OSError: [Errno 22] Invalid argument: b'/private/var/folders/3g/1vds4g8d4p3909hrwr65j6300000gn/T/tmp.hMGmSKWkAJ/.git/refs/exps/temp/head-aX7recLVy4xhzRcGVPL4w9.lock'
  File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 85, in GitFile
    return _GitFile(filename, mode, bufsize, mask)
  File "/Users/user/projects/dvcorg/dvc/.venv/lib/python3.13/site-packages/dulwich/file.py", line 144, in __init__
    fd = os.open(
        self._lockfilename,
        os.O_RDWR | os.O_CREAT | os.O_EXCL | getattr(os, "O_BINARY", 0),
        mask,
    )
FileNotFoundError: [Errno 2] No such file or directory: b'/private/var/folders/3g/1vds4g8d4p3909hrwr65j6300000gn/T/tmp.hMGmSKWkAJ/.git/refs/exps/temp/head-ABtmsLg4wJvDGBeXVsSMyP.lock'

EDIT: looks like this is happening because dulwich removes parent refs directories while another process maybe trying to write to the same directory.

skshetry avatar Aug 14 '25 12:08 skshetry

The issue is in Git reference management:

Each experiment creates temporary Git references (refs/exps/temp/...) Multiple workers compete for access to Git's internal files Git uses file locking to prevent corruption, but DVC isn't handling lock conflicts properly

ricaxvi avatar Aug 20 '25 12:08 ricaxvi

DVC isn't handling lock conflicts properly

Well, we try too. These changes are done behind an rwlock. Here those refs are packed, making that no longer safe.

skshetry avatar Aug 20 '25 12:08 skshetry

The rwlock protection breaks down because packed refs bypass the individual file locking mechanism

ricaxvi avatar Aug 20 '25 13:08 ricaxvi

Detect when refs get packed and fall back to repo-level locking

ricaxvi avatar Aug 20 '25 13:08 ricaxvi

or Move to isolated workspaces per worker so packing doesn't affect other workers

ricaxvi avatar Aug 20 '25 13:08 ricaxvi

better still Move away from shared Git state entirely. Each worker gets its own Git workspace, eliminating the race condition at the source rather than trying to coordinate around it. This scales better and is more robust than trying to perfect the locking dance around Git's internal optimizations.

ricaxvi avatar Aug 20 '25 13:08 ricaxvi