Lifetime parameters
When first trying to use dask_jobqueue, I was used to using LocalCluster from distributed where **kwargs is really **worker_kwargs and was confused why some options were making it through to the worker but others were not like lifetime. Stepping through the code, I saw that **kwargs for JobCluster were propagated to the options key of the worker argument to SpecCluster and then on to the **kwargs of dask_jobqueue.core.Job where it was ignored. I wonder if lifetime, lifetime-stagger, and lifetime-restart should be handled by dask_jobqueue.core.Job like other command_args are?
Also, I wonder if there should be a warning for any unhandled **kwargs that make it to dask_jobqueue.core.Job, since the user (like me) is probably confused if they are sending arguments that are making it that far down (I don't think arguments can make it that far in **kwargs unless they were not used higher up?). The thing that it took me a while to realize was that the dask_jobqueue.core.Job subclasses kind of are the dask Worker as far as JobCluster is concerned. Renaming extra to dask_worker_additional_args as suggested in #323 would help some with this confusion.
Not sure if you have seen but https://github.com/dask/dask-jobqueue/pull/398 is an attempt to have an error for ignored parameters.
Not much time currently to work on this unfortunately ...
Ah, I was sure @lesteve mentionned something like this in a issue not so long ago! Thanks @lesteve.
Ah, okay, #386 is the same as what I mentioned in my second paragraph above.
I wonder if lifetime, lifetime-stagger, and lifetime-restart should be handled by dask_jobqueue.core.Job like other command_args are?
It is true that lifetime parameters are important to dask-jobqueue. They could be added for this reason. But maybe we should define some policy, either:
- Remove all kwargs that are not related to queuing system. All worker kwargs rely on extra.
- Remove all kwargs not related to queuing system, and use a wildcard **kwargs that we propagate.
- Add worker kwargs we believe are important for dask-jobqueue, and rely on extra for the others (the closer to the actual implementation).
If there are some thoughts here from others...
First a status of declared worker kwargs:
-
protocol,securityandinterfaceare used both for Scheduler and Worker, so they are declared inJobQueueClusterand propagated toJobimplementations. - Every non used kwarg from
JobQueueclusterare propagated toJobimplementations, but will generate errors if not used there. -
cores,memoryare used to computenthreads,nworkers,memory-limitWorker args, but are not directly linked to Worker kwargs as they are also used for Job directives. -
processesis used to computenthreads, and to getnworkersWorker kwarg. -
name,nanny,death_timeout,local_directoryare used to add associated CLI Worker args.
Then, a clarification:
Remove all kwargs not related to queuing system, and use a wildcard **kwargs that we propagate.
We do not use Nanny or Worker class, and won't do it. With SpecCluster and the use of Job, it is not possible to do that easily. So we can forget this solution.
So that leaves us with:
- Remove all kwargs that are not related to queuing system. All worker kwargs rely on
worker_extra_args. This means removing
name,nanny,death_timeout,local_directory. We still needprocessesandmemoryI guess. - Add worker kwargs we believe are important for dask-jobqueue, and rely on extra for the others (the closer to the actual implementation): are
lifetime,lifetime-stagger, andlifetime-restartimportant enough? What other keywords?
My opinion here is to stay with the current situation. Maybe we have already too many optional kwargs for Workers, but I don't want to remove one of them, as they are really important.
The only other occurrences of extra kwargs in the docs are the lifetime* ones, and resources. I thinks they are advanced use cases, and can remain outside of Job implementation constructor kwargs.
I can propose a rule to decide whether to add a Worker kwarg to our Job kwarg:
If a Worker kwarg default needs to be updated for a good behavior of Dask on a job queueing system or if this kwarg needs to be coherent between Scheduler and Workers, then it must be declared as a Job kwarg.
death_timeout, local_directory, nthreads, memory-limit, nworkers (and so related cores, memory and processes) clearly fall into this category. interface also, and as protocol and security are needed by both Scheduler and Workers. name is an exception, it is required for SpecCluster mechanism. That leaves nanny which could be removed, but is also of a high importance in some setup.
Others proposed here or in the doc do not match the rule above.