Support for OpenPBS/PBS Pro Scheduler
It would be highly useful to add support for the OpenPBS/PBS Pro Scheduler, as far as I understand SLURM is very similar?
From a very brief look at the codebase it already looks like all the SLURM related code is split out into its own module. I would offer to try contribute to this feature depending on available time and with appropriate direction, although I am not intimately familiar with either schedulers.
Downstream this would also lead to support for the hydra submitit plugin.
Hello, this would be great.
I think the best thing would be to implement this as a plugin, because I don't have access to a cluster with OpenPBS, I won't be able to test it thoroughly or commit to maintain it. In order to detect breaking changes in the API, we can run do some CI integration with you plugin.
There is already a very small guide inside the documentation. https://github.com/facebookincubator/submitit/blob/master/docs/plugins.md
The main assumption in submitit is that there exists some kind of filesystem on the cluster, that is accessible both by the host and by the job.
To get you started the easier is to fork this repo and create a "pbs" folder inside it (next to "slurm"). We'll get on how to move it to a plugin later on. For now add your plugin directly here: https://github.com/facebookincubator/submitit/blob/89fbe3144fa5ad3674178ecba8fd53da21f072ca/submitit/core/plugins.py#L31
You'll need to implement 4 classes:
-
PbsExecutor: takes a python function, starts the job on the cluster and returns a job object with a job id. To get you started you can look at the "mixin"PicklingExecutorclass and theSlurmExecutor. The simplest way to test it is by directly using thePBSExecutorinstead of integrating with theAutoExecutor. ThePicklingExecutorwhich you should inherit does the following:- make a pickle from the function
- calls
self._submit_command(self._submitit_command_str)(to be implemented by you)- typically you just run
python -u -m submitit.core._submit WORKING_FOLDERon the cluster - return a Job instance with the correct job id
- typically you just run
- copy the pickle to the job.submitted_pickle
-
PbsJob: represent a job, with method to get the current state, and to cancel it. The main thing you should override at first is theget_infoto get the status of the job. -
PbsJobEnvironment: read job environment to extract variables set by PBS and read them from the job. You need to implement thejob_idmethod. This is the first thing called by the job when it starts (seesubmitit_main) to find the files with the pickled function. -
PbsInfoWatcher: this is used to provide some caching/batching for thejob.get_info(), you don't need it for a prototype but you'll need it later on. You can look at theDebugInfoWatcherto make a dummy InfoWatcher.
Thank you for the detailed response. That gives me a good place to start, if I can get around to it, hopefully!
Hi all, I am just wondering if anyone already attempted to create a PBS plugin? In anycase, I will start looking into this. Best,
@Aneoshun Unfortunately not due to time constraints. I had started looking into it, but I made no real progress, fortunately was able to use an alternative scheduler at the time.
Hi all, I am just wondering if anyone already attempted to create a PBS plugin? In anycase, I will start looking into this. Best,
I would be interested in that too ; if some finds a clean solution it'd be nice that they share it here :)
@nmichlo I have access to PBS cluster. However, I am able to use only a single node with 4 gpus. I have tried with 'Horovod' as well., but I found it very slow and failed to train a model successfully. Could you please share with me the alternative scheduler that you've used.
@Alam45, I switched back to SLURM. Sorry I couldn't be of more help.
However, would still love to see this feature implemented.