dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

Generalise submission and cancellation arguments

Open AlecThomson opened this issue 1 year ago • 7 comments

Closes #640

AlecThomson avatar May 24 '24 09:05 AlecThomson

I found that the HTCondor class already had something similar. I've added this to the base Job class in core. The downside is that this adds some extra boilerplate.

I've reworked the HTCondorJob and SLURMJob to make use of the new functionality

AlecThomson avatar May 25 '24 01:05 AlecThomson

Also, the test failures in slurm (and maybe others) looks related to this change.

guillaumeeb avatar May 31 '24 15:05 guillaumeeb

CI is now mostly happy on main so I've just merged in so we can see up to date CI failures.

jacobtomlinson avatar Aug 06 '24 13:08 jacobtomlinson

Sorry for the long lead time on this, everyone. I got myself tripped up between the methods on the base and inheriting classes - as noted by @guillaumeeb's comment. I believe I've got this all sorted now. Hopefully the CI will catch any lingering issues.

AlecThomson avatar Aug 07 '24 05:08 AlecThomson

Thanks @AlecThomson! Looks like there are some linting issues (make sure you run pre-commit install) and some slurm issues.

jacobtomlinson avatar Aug 07 '24 10:08 jacobtomlinson

Hmm - looks like some kind of timing error on the test. I don't quite understand why it's failing... 🤔

>                   assert time() < start + QUEUE_WAIT
E                   assert 1723021595.3378592 < (1723021535.2718754 + 60)
E                    +  where 1723021595.3378592 = time()

https://github.com/dask/dask-jobqueue/actions/runs/10278490598/job/28450471395#step:7:425

AlecThomson avatar Aug 07 '24 10:08 AlecThomson

It is calling cluster.scale(n) and then waiting for the cluster to scale. The time assertion is just a timeout, so it's not scaling to the correct number in the time allowed.

Note: We don't use client.wait_for_workers(n) because that checks for "at least n workers" so doesn't wait when scaling down (xref https://github.com/dask/distributed/pull/6374).

jacobtomlinson avatar Aug 07 '24 10:08 jacobtomlinson