`dvc exp run --run-all`: only runs one experiment and then hangs
Bug Report
Description
When running dvc exp run --run-all, only one experiment gets run fully, and then the next experiment never starts.
Reproduce
- Add multiple experiments to the queue with
dvc exp run --queue -
dvc exp run --run-all
Expected
All experiments should run sequentially.
Environment information
$ dvc doctor
DVC version: 2.18.1 (pip)
---------------------------------
Platform: Python 3.10.4 on Linux-5.4.0-80-generic-x86_64-with-glibc2.31
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.5),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.5),
webhdfs (fsspec = 2022.7.1)
Cache types: hardlink, symlink
Cache directory: nfs4 on <REDACTED>
Caches: local
Remotes: None
Workspace directory: nfs4 on <REDACTED>
Repo: dvc, git
Additional information
There seem to be no workers active even though there should? This is the output after starting the queue after the failed exp run --run-all previously mentioned
$ dvc queue start
Started '1' new experiments task queue worker.
$ dvc queue status
Task Name Created Status
c79d124 Aug 22, 2022 Running
eb3633a Aug 22, 2022 Queued
3dcf13c Aug 22, 2022 Queued
11a6ea4 Aug 22, 2022 Queued
a2f2386 Aug 22, 2022 Queued
a49a3bf Aug 22, 2022 Queued
5344ab1 Aug 22, 2022 Queued
4ca8ed7 Aug 22, 2022 Queued
8b0ba0b Aug 22, 2022 Queued
cb7d0d0 exp-03827 Aug 22, 2022 Success
Worker status: 0 active, 0 idle
Hi, @werthen, Could you please provide some more details logs ?
The logs celery works would be in .dvc/tmp/exps/celery/dvc-exp-worker-1.out
I can confirm that I have the same bug (except I used queue start):

In my case .dvc/tmp/exps/celery/dvc-exp-worker-1.out shows:
[2022-08-27 13:11:04,246: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
-------------- dvc-exp-ccd556-1@localhost v5.3.0a1 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.4.0-124-generic-x86_64-with-glibc2.17 2022-08-27 13:11:04
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: dvc-exp-local:0x7ff0f8aa4d90
- ** ---------- .> transport: filesystem://localhost//
- ** ---------- .> results: file:///m/home/home0/07/porkhom1/data/Desktop/case_similarity/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. dvc.repo.experiments.queue.tasks.cleanup_exp
. dvc.repo.experiments.queue.tasks.collect_exp
. dvc.repo.experiments.queue.tasks.run_exp
. dvc.repo.experiments.queue.tasks.setup_exp
. dvc_task.proc.tasks.run
[2022-08-27 13:11:04,331: WARNING/MainProcess] /m/home/home0/07/porkhom1/data/Desktop/case_similarity/thesis-venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:491: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(
[2022-08-27 13:11:04,332: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2022-08-27 13:11:04,332: INFO/MainProcess] Connected to filesystem://localhost//
[2022-08-27 13:11:04,348: INFO/MainProcess] dvc-exp-ccd556-1@localhost ready.
[2022-08-27 13:11:04,354: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[8e142599-089c-44ee-b10a-ce77c72ab735] received
[2022-08-27 13:11:06,350: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-ccd556-1@localhost'
[2022-08-27 13:38:57,495: INFO/MainProcess] monitor: shutting down due to empty queue.
[2022-08-27 13:38:57,500: INFO/MainProcess] monitor: done
[2022-08-27 13:38:58,494: WARNING/MainProcess] Got shutdown from remote
[2022-08-27 15:10:44,707: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[8e142599-089c-44ee-b10a-ce77c72ab735] succeeded in 7180.3525715400465s: None
[2022-08-27 16:58:02,229: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
-------------- dvc-exp-ccd556-1@localhost v5.3.0a1 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.4.0-124-generic-x86_64-with-glibc2.17 2022-08-27 16:58:02
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: dvc-exp-local:0x7fd7b6256d90
- ** ---------- .> transport: filesystem://localhost//
- ** ---------- .> results: file:///m/home/home0/07/porkhom1/data/Desktop/case_similarity/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. dvc.repo.experiments.queue.tasks.cleanup_exp
. dvc.repo.experiments.queue.tasks.collect_exp
. dvc.repo.experiments.queue.tasks.run_exp
. dvc.repo.experiments.queue.tasks.setup_exp
. dvc_task.proc.tasks.run
[2022-08-27 16:58:02,307: WARNING/MainProcess] /m/home/home0/07/porkhom1/data/Desktop/case_similarity/thesis-venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:491: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(
[2022-08-27 16:58:02,307: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2022-08-27 16:58:02,307: INFO/MainProcess] Connected to filesystem://localhost//
[2022-08-27 16:58:02,323: INFO/MainProcess] dvc-exp-ccd556-1@localhost ready.
[2022-08-27 16:58:02,329: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[2da39147-c38a-47cc-affe-7e775cac673b] received
[2022-08-27 16:58:02,557: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-ccd556-1@localhost'
[2022-08-27 18:37:15,731: INFO/MainProcess] monitor: shutting down due to empty queue.
[2022-08-27 18:37:15,736: INFO/MainProcess] monitor: done
[2022-08-27 18:37:16,712: WARNING/MainProcess] Got shutdown from remote
[2022-08-27 18:57:30,737: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[2da39147-c38a-47cc-affe-7e775cac673b] succeeded in 7168.40772138699s: None
In previous we had a https://github.com/iterative/dvc-task/pull/78, but it ought to been fixed in 2.17.0. Could you please give me the version of dvc-task you are in (pip list | grep dvc-task).
Hey @karajan1001 thanks for the quick response.
I get 0.1.0. I will try to update and see if this resolves the issue - thanks!
Hey @karajan1001 thanks for the quick response.
I get
0.1.0. I will try to update and see if this resolves the issue - thanks!
Yeah, the bug was fixed in version 0.1.2.
@karajan1001 I updated dvc and now everything is running like a charm! Thanks
closing as stale(resolved?)