reportseff icon indicating copy to clipboard operation
reportseff copied to clipboard

Feature: Handle multi-cluster environments

Open angel-devicente opened this issue 2 years ago • 4 comments

I'm using reportseff in a multi-cluster environment, and from a machine that is running slurmdbd but where slurmctld daemons are running in other machines. For this, when I run sacct I have no issue, since I can do sacct -M <server> or even sacct -M all, which is very handy for central administration of the multi-cluster environment.

With reportseff, I can do something similar reportseff --extra-args "-M <server>", but as it is, it is not working because db_inquirer.py issues the command command_args = "scontrol show partition".split(), which should get the -M <server> specification as well.

Giving the whole extra-args to the scontrol command is not, I suppose, a good option, since sacct and scontrol don't share the same args. Perhaps a new "-M" parameter can be added, so if provided it is simply added to both the sacct and scontrol commands? I would add it myself, but not sure what is done with the partition information in reportseff, and how you would handle the situation when "-M" refers to more than one server.

angel-devicente avatar Oct 01 '23 09:10 angel-devicente

Can you specify what is "not working"? Currently, scontrol is just utilized to get the time limits for each partition. If you don't use a partition time limit it should be a no-op and on multiple clusters it could cause some inaccuracies if the same partition is specified multiple times with different time limits.

Here are my first thoughts:

  • An option to disable scontrol calls if the problem is the slurmctld daemon is running on other machines.
  • An option to specify a server to reportseff that would append it to sacct and scontrol.

The main issue I foresee with this is a call for -M all may clobber job ids. reportseff uses the jobid as a unique identifier and if clusters each have the same job id the sacct parsing will get mangled.

Feel free to add the server option. Handling duplicate job ids is more challenging as it would require a significant rewrite of jobs and job collections (which is needed to handle retries anyways).

troycomi avatar Oct 02 '23 12:10 troycomi

By "not working" I meant that since I'm running reportseff in a machine where no slurmctld is running, the scontrol command just hangs.

Option two is what I had in mind (being able to add something lime "-M " which would then be passed onto both sacct and scontrol commands inside reportseff.

When I have some time, I will add this and submit a PR.

angel-devicente avatar Oct 04 '23 14:10 angel-devicente

Added a PR with --cluster option

Lafond-LapalmeJ avatar Sep 24 '24 15:09 Lafond-LapalmeJ

Ah, excellent, many thanks for the PR. If I have some time next week, I'll try it.

angel-devicente avatar Sep 28 '24 07:09 angel-devicente

Should be implemented in v2.8.0. Thanks @Lafond-LapalmeJ for the work!

Please open an issue if things don't look as expected.

troycomi avatar Oct 03 '24 14:10 troycomi