dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

Provide custom header

Open mrocklin opened this issue 6 years ago • 14 comments

Currently we take Pythonic keyword arguments and use them to generate a job_header. While often convenient, this transformation can also get in the way. Sometimes things would be simpler if we just provided our own job header directly. This can be particularly convenient for picky clusters, or for IT users that are more comfortable with job scripts than with Python arguments.

Should we accept a job_header= keyword or configuration value to override the job_header construction?

mrocklin avatar Aug 20 '19 20:08 mrocklin

No strong feeling on this one. My only small concern is that it adds a bit of complexity with a non-trivial interaction with the other parameters.

If we do that, it'd be good to be very clear in the docstring and mention which parameters are overriden by using job_header. There will be parameters for which this is not 100% clear cut, for example should job_extra be ignored if job_header is specified? What about config_name some parameters from the named config will be used but some will be silently ignored. Another possibility is to have a job_header_template, so that you can chose to reuse some parameters, e.g. cores and memory rather than build it by hand.

Taking a step back, something I would say is that at the moment the job_script is built by small pieces in each cluster class (with some logic in the base class as well) and sometimes it is quite hard to have an overall view of what the job_script is going to look like. Having a jinja template (or something simpler if possible) was discussed at one point, maybe that's a good idea to investigate. Maybe it would help to have an overall vision of the job_script (including job_header) from the python arguments and reduce the need for an additional job_header argument?

Side-comment: IIRC for picky clusters you can use job_extra (by the way job_extra and env_extra are not very good names ...). job_extra will be at the end of the header and we have had reports where this was overriding the previous values of header. Note: this is the kind of edge cases that will not work uniformly across cluster configurations and job queue schedulers.

lesteve avatar Aug 23 '19 07:08 lesteve

I think that implementing #133 and give the ability to override the template would be a good solution.

guillaumeeb avatar Aug 25 '19 13:08 guillaumeeb

I ran into problems with LSFCluster where it was injecting lines into the header that LSF would not accept. I wanted something like a job_remove= keyword. But this can probably be fixed by improving the LSFCluster class and removing some options that are, perhaps, too specific to the clusters of the people that maintain that Cluster class.

Additionally, the terms that the group used were very different from project and queue. They had other terms like partition. We started from a job submission script that we knew worked, and had to reverse engineer which Python parameters affected which job submission script outputs. It felt incorrect. But this can probably be fixed with better documentation.

mrocklin avatar Aug 25 '19 15:08 mrocklin

One approach would be to allow users to explicitly skip lines that their cluster manager finds troublesome. I put this into a commit in the spec-rewrite branch for now (I need it for testing) here: https://github.com/dask/dask-jobqueue/pull/307/commits/172ecd43917f2f35a6d51d1fcb825267521425f1

What do people think of this approach?

mrocklin avatar Aug 25 '19 23:08 mrocklin

My feeling is that currently there are (implicitly) two tiers of support for the job schedulers:

  • support: SGE, PBS, SLURM
  • best-effort support: all the others including LSF

In my mind, moving a scheduler to the "support" tier needs at least a docker-compose CI set-up. Another great thing to have is one (ideally more) active person with access to this job scheduler so that modifications can be tested on a real cluster.

Out of curiosity, what kind of lines did you want to be removed from the header?

lesteve avatar Aug 26 '19 11:08 lesteve

To get things to work on Summit (Oak Ridge National Labs machine, currently #1 in the Top 500) I had to remove -R, -M and -n.

mrocklin avatar Aug 26 '19 14:08 mrocklin

One approach would be to allow users to explicitly skip lines that their cluster manager finds troublesome. I put this into a commit in the spec-rewrite branch for now (I need it for testing) here: 172ecd4 What do people think of this approach?

I have to say I don't like it. I really think we should provide a templated header or a templated job script, and give the possibility to replace it.

guillaumeeb avatar Aug 26 '19 14:08 guillaumeeb

So we would have something like:

header = """
#PBS -l walltime={walltime}
#PBS -l select=1:ncpus={nthreads}:mem={memory}
...
"""

And then rather than build up the job_header we just plug it in?

def job_script(self):
    header = self.job_header % self.header_options
    ...

mrocklin avatar Aug 26 '19 14:08 mrocklin

If so, I'm in favor

mrocklin avatar Aug 26 '19 14:08 mrocklin

It might be tricky though. I think that there are some lines that are added only conditionally.

mrocklin avatar Aug 26 '19 14:08 mrocklin

Yes, that's the idea.

Using Jinja2 should allow conditionnals!

guillaumeeb avatar Aug 26 '19 15:08 guillaumeeb

My general feeling is that LSFCluster is a bit rough around the edges. I would be enclined towards improving LSFCluster in the short-term to do what you need and leave the jinja2/templating option for later.

To get things to work on Summit (Oak Ridge National Labs machine, currently #1 in the Top 500) I had to remove -R, -M and -n.

@mrocklin could you let us know what you replaced -R, -M and -n with. I am guessing you still need to specify somehow:

  • the memory used by your job
  • the number of cores used by your job
  • that all of the allocated cores should be on the same node

It might be tricky though. I think that there are some lines that are added only conditionally.

yes, for example setting resource_spec would replace your select=1... line I think.

I think at the moment there are a few different way to get the job header to make the job scheduler happy and is not very consistent across job schedulers. For example, resource_spec does not exist in LSFCluster and SLURMCluster but exists in SGECluster and PBSCluster.

Using Jinja2 should allow conditionnals!

It would but IMO the readability of the jinja2 template is also something important to keep in mind.

lesteve avatar Aug 26 '19 15:08 lesteve

@mrocklin could you let us know what you replaced -R, -M and -n with. I am guessing you still need to specify somehow:

Nothing. I suspect that it is defaulting to giving me one full node per job

mrocklin avatar Aug 26 '19 15:08 mrocklin

Nothing. I suspect that it is defaulting to giving me one full node per job

😄 Yeah, with Summit, there's no point using less than a full node!!

guillaumeeb avatar Aug 26 '19 15:08 guillaumeeb

Closing this: header_skip kwarg is implemented, for the rest we should rely on a Jinja template.

guillaumeeb avatar Aug 30 '22 06:08 guillaumeeb