dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`dvc exp run --run-all`: only runs one experiment and then hangs

Open werthen opened this issue 3 years ago • 6 comments

Bug Report

Description

When running dvc exp run --run-all, only one experiment gets run fully, and then the next experiment never starts.

Reproduce

  1. Add multiple experiments to the queue with dvc exp run --queue
  2. dvc exp run --run-all

Expected

All experiments should run sequentially.

Environment information

$ dvc doctor
DVC version: 2.18.1 (pip)
---------------------------------
Platform: Python 3.10.4 on Linux-5.4.0-80-generic-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.5),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.5),
        webhdfs (fsspec = 2022.7.1)
Cache types: hardlink, symlink
Cache directory: nfs4 on <REDACTED>
Caches: local
Remotes: None
Workspace directory: nfs4 on <REDACTED>
Repo: dvc, git

Additional information

There seem to be no workers active even though there should? This is the output after starting the queue after the failed exp run --run-all previously mentioned

$ dvc queue start
Started '1' new experiments task queue worker.
$ dvc queue status
Task     Name       Created       Status
c79d124             Aug 22, 2022  Running
eb3633a             Aug 22, 2022  Queued
3dcf13c             Aug 22, 2022  Queued
11a6ea4             Aug 22, 2022  Queued
a2f2386             Aug 22, 2022  Queued
a49a3bf             Aug 22, 2022  Queued
5344ab1             Aug 22, 2022  Queued
4ca8ed7             Aug 22, 2022  Queued
8b0ba0b             Aug 22, 2022  Queued
cb7d0d0  exp-03827  Aug 22, 2022  Success

Worker status: 0 active, 0 idle

werthen avatar Aug 23 '22 08:08 werthen

Hi, @werthen, Could you please provide some more details logs ? The logs celery works would be in .dvc/tmp/exps/celery/dvc-exp-worker-1.out

karajan1001 avatar Aug 25 '22 09:08 karajan1001

I can confirm that I have the same bug (except I used queue start): Screenshot 2022-08-27 at 17 01 27

In my case .dvc/tmp/exps/celery/dvc-exp-worker-1.out shows:

[2022-08-27 13:11:04,246: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'

 -------------- dvc-exp-ccd556-1@localhost v5.3.0a1 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.4.0-124-generic-x86_64-with-glibc2.17 2022-08-27 13:11:04
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x7ff0f8aa4d90
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///m/home/home0/07/porkhom1/data/Desktop/case_similarity/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2022-08-27 13:11:04,331: WARNING/MainProcess] /m/home/home0/07/porkhom1/data/Desktop/case_similarity/thesis-venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:491: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
  warnings.warn(

[2022-08-27 13:11:04,332: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2022-08-27 13:11:04,332: INFO/MainProcess] Connected to filesystem://localhost//
[2022-08-27 13:11:04,348: INFO/MainProcess] dvc-exp-ccd556-1@localhost ready.
[2022-08-27 13:11:04,354: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[8e142599-089c-44ee-b10a-ce77c72ab735] received
[2022-08-27 13:11:06,350: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-ccd556-1@localhost'
[2022-08-27 13:38:57,495: INFO/MainProcess] monitor: shutting down due to empty queue.
[2022-08-27 13:38:57,500: INFO/MainProcess] monitor: done
[2022-08-27 13:38:58,494: WARNING/MainProcess] Got shutdown from remote
[2022-08-27 15:10:44,707: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[8e142599-089c-44ee-b10a-ce77c72ab735] succeeded in 7180.3525715400465s: None
[2022-08-27 16:58:02,229: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'

 -------------- dvc-exp-ccd556-1@localhost v5.3.0a1 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.4.0-124-generic-x86_64-with-glibc2.17 2022-08-27 16:58:02
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x7fd7b6256d90
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///m/home/home0/07/porkhom1/data/Desktop/case_similarity/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2022-08-27 16:58:02,307: WARNING/MainProcess] /m/home/home0/07/porkhom1/data/Desktop/case_similarity/thesis-venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:491: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
  warnings.warn(

[2022-08-27 16:58:02,307: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2022-08-27 16:58:02,307: INFO/MainProcess] Connected to filesystem://localhost//
[2022-08-27 16:58:02,323: INFO/MainProcess] dvc-exp-ccd556-1@localhost ready.
[2022-08-27 16:58:02,329: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[2da39147-c38a-47cc-affe-7e775cac673b] received
[2022-08-27 16:58:02,557: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-ccd556-1@localhost'
[2022-08-27 18:37:15,731: INFO/MainProcess] monitor: shutting down due to empty queue.
[2022-08-27 18:37:15,736: INFO/MainProcess] monitor: done
[2022-08-27 18:37:16,712: WARNING/MainProcess] Got shutdown from remote
[2022-08-27 18:57:30,737: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[2da39147-c38a-47cc-affe-7e775cac673b] succeeded in 7168.40772138699s: None

JohnTheDeere avatar Aug 27 '22 16:08 JohnTheDeere

In previous we had a https://github.com/iterative/dvc-task/pull/78, but it ought to been fixed in 2.17.0. Could you please give me the version of dvc-task you are in (pip list | grep dvc-task).

karajan1001 avatar Aug 28 '22 01:08 karajan1001

Hey @karajan1001 thanks for the quick response.

I get 0.1.0. I will try to update and see if this resolves the issue - thanks!

JohnTheDeere avatar Aug 28 '22 08:08 JohnTheDeere

Hey @karajan1001 thanks for the quick response.

I get 0.1.0. I will try to update and see if this resolves the issue - thanks!

Yeah, the bug was fixed in version 0.1.2.

karajan1001 avatar Aug 29 '22 05:08 karajan1001

@karajan1001 I updated dvc and now everything is running like a charm! Thanks

JohnTheDeere avatar Aug 29 '22 11:08 JohnTheDeere

closing as stale(resolved?)

efiop avatar Oct 04 '22 18:10 efiop