submitit icon indicating copy to clipboard operation
submitit copied to clipboard

Support for OpenPBS/PBS Pro Scheduler

Open nmichlo opened this issue 5 years ago • 7 comments

It would be highly useful to add support for the OpenPBS/PBS Pro Scheduler, as far as I understand SLURM is very similar?

From a very brief look at the codebase it already looks like all the SLURM related code is split out into its own module. I would offer to try contribute to this feature depending on available time and with appropriate direction, although I am not intimately familiar with either schedulers.

Downstream this would also lead to support for the hydra submitit plugin.

nmichlo avatar Sep 18 '20 18:09 nmichlo

Hello, this would be great.

I think the best thing would be to implement this as a plugin, because I don't have access to a cluster with OpenPBS, I won't be able to test it thoroughly or commit to maintain it. In order to detect breaking changes in the API, we can run do some CI integration with you plugin.

There is already a very small guide inside the documentation. https://github.com/facebookincubator/submitit/blob/master/docs/plugins.md

The main assumption in submitit is that there exists some kind of filesystem on the cluster, that is accessible both by the host and by the job.

To get you started the easier is to fork this repo and create a "pbs" folder inside it (next to "slurm"). We'll get on how to move it to a plugin later on. For now add your plugin directly here: https://github.com/facebookincubator/submitit/blob/89fbe3144fa5ad3674178ecba8fd53da21f072ca/submitit/core/plugins.py#L31

You'll need to implement 4 classes:

  • PbsExecutor : takes a python function, starts the job on the cluster and returns a job object with a job id. To get you started you can look at the "mixin" PicklingExecutor class and the SlurmExecutor. The simplest way to test it is by directly using the PBSExecutor instead of integrating with the AutoExecutor. The PicklingExecutor which you should inherit does the following:

    • make a pickle from the function
    • calls self._submit_command(self._submitit_command_str) (to be implemented by you)
      • typically you just run python -u -m submitit.core._submit WORKING_FOLDER on the cluster
      • return a Job instance with the correct job id
    • copy the pickle to the job.submitted_pickle
  • PbsJob: represent a job, with method to get the current state, and to cancel it. The main thing you should override at first is the get_info to get the status of the job.

  • PbsJobEnvironment : read job environment to extract variables set by PBS and read them from the job. You need to implement the job_id method. This is the first thing called by the job when it starts (see submitit_main) to find the files with the pickled function.

  • PbsInfoWatcher: this is used to provide some caching/batching for the job.get_info(), you don't need it for a prototype but you'll need it later on. You can look at the DebugInfoWatcher to make a dummy InfoWatcher.

gwenzek avatar Sep 21 '20 09:09 gwenzek

Thank you for the detailed response. That gives me a good place to start, if I can get around to it, hopefully!

nmichlo avatar Sep 26 '20 14:09 nmichlo

Hi all, I am just wondering if anyone already attempted to create a PBS plugin? In anycase, I will start looking into this. Best,

Aneoshun avatar Mar 22 '23 14:03 Aneoshun

@Aneoshun Unfortunately not due to time constraints. I had started looking into it, but I made no real progress, fortunately was able to use an alternative scheduler at the time.

nmichlo avatar Mar 23 '23 09:03 nmichlo

Hi all, I am just wondering if anyone already attempted to create a PBS plugin? In anycase, I will start looking into this. Best,

I would be interested in that too ; if some finds a clean solution it'd be nice that they share it here :)

alexisthual avatar Mar 23 '23 17:03 alexisthual

@nmichlo I have access to PBS cluster. However, I am able to use only a single node with 4 gpus. I have tried with 'Horovod' as well., but I found it very slow and failed to train a model successfully. Could you please share with me the alternative scheduler that you've used.

Alam45 avatar Jun 22 '23 08:06 Alam45

@Alam45, I switched back to SLURM. Sorry I couldn't be of more help.

However, would still love to see this feature implemented.

nmichlo avatar Jun 23 '23 10:06 nmichlo