spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

update to run_sorter_jobs() and slurm

Open MarinManuel opened this issue 1 year ago • 7 comments

I am trying to run spikeinterface sorters on a HPC cluster and needed to pass specific parameters to slurm/sbatch.

I added the option to pass arbitrary arguments to sbatch in the engine_kwargs.

Let me know if that's causing any issues

MarinManuel avatar Jun 29 '24 20:06 MarinManuel

@MarinManuel thank you! This is great!

The SLURM deployment has been unexplored.on our side, so we are very glad that it works for you.

Would you mind sharing the script that you are using? I'd love to add an How To page on how to deploy multiple spike sorting jobs on SLURM

alejoe91 avatar Jun 30 '24 07:06 alejoe91

Hey @MarinManuel sorry for the delay in testing this locally, I tried it on our cluster and seems to be working well! I had a couple of other comments below, it might also be worth adding a test for this. I have pasted one in the dropdown below, I don't think I can push to your fork, but feel free to add it to src/spikeinterface/sorters/tests/test_launcher.py underneath test_run_sorter_jobs_slurm.

If you think of any functionality this is missing, please let me know / feel free to expand the test. This will also require adding pytest-mock to the dev dependencies in pyproject.toml in the main repo folder. It would be cool to also test the generated .py files sometime, but that's outside the scope of this PR.

The test
def test_run_sorter_jobs_slurm_kwargs(mocker, tmp_path, job_list):
    """
    Mock `subprocess.run()` to check that engine_kwargs are
    propagated to the call as expected.
    """
    # First, mock `subprocess.run()`, set up a call to `run_sorter_jobs`
    # then check the mocked `subprocess.run()` was called with the
    # expected signature. Two jobs are passed in `jobs_list`, first
    # check the most recent call.
    mock_subprocess_run = mocker.patch(
        "spikeinterface.sorters.launcher.subprocess.run"
    )

    tmp_script_folder = tmp_path / "slurm_scripts"

    engine_kwargs = engine_kwargs=dict(
        tmp_script_folder=tmp_script_folder,
        sbatch_args={
            "cpus-per-task": 32,
            "mem": "32G",
            "gres": "gpu:1",
            "any_random_kwarg": 12322,
        }
    )
    run_sorter_jobs(
        job_list,
        engine="slurm",
        engine_kwargs=engine_kwargs,
    )

    script_0_path = f"{tmp_script_folder}/si_script_0.py"
    script_1_path = f"{tmp_script_folder}/si_script_1.py"

    expected_command = [
        "sbatch", "--cpus-per-task", "32", "--mem", "32G", "--gres", "gpu:1", "--any_random_kwarg", "12322", script_1_path
    ]
    mock_subprocess_run.assert_called_with(expected_command, capture_output=True, text=True)


    # Next, check the fisrt call (which sets up `si_script_0.py`)
    # also has the expected arguments.
    expected_command[9] = script_0_path
    assert mock_subprocess_run.call_args_list[0].args[0] == expected_command

    # Next, check that defaults are used properly when no kwargs are
    # passed. This will default to `_default_engine_kwargs` as
    # set in `launcher.py`
    run_sorter_jobs(
        job_list,
        engine="slurm",
        engine_kwargs={"tmp_script_folder": tmp_script_folder},
    )
    expected_command = [
        "sbatch", "--cpus-per-task", "1", "--mem", "1G", script_1_path
    ]
    mock_subprocess_run.assert_called_with(expected_command, capture_output=True, text=True)

    # Finally, check that the `tmp_script_folder` is generated on the
    # fly as expected. A random foldername is generated, just check that
    # the folder to which the scripts are saved is in the `tempfile` format.
    run_sorter_jobs(
        job_list,
        engine="slurm",
        engine_kwargs=None,  # TODO: test defaults
    )
    tmp_script_folder = "_".join(tempfile.mkdtemp(prefix="spikeinterface_slurm_").split("_")[:-1])
    assert tmp_script_folder in mock_subprocess_run.call_args_list[-1].args[0][5]

Some other small things:

  • I think that the other, currently existing test (test_run_sorter_jobs_slurm) might need to be updated to the new way of passing the sbatch_args.
  • I think it is worth adding an error if any other key-value pair exists in engine_kwargs except for tmp_script_folder or sbatch_args. The reason is this PR is backward-incompatible with previous versions but will not error, only silently fail to pass the kwargs onwards to slurm (e.g. if cpus_per_task is passed to engine_kwargs directly as it was previously, it will just be ignored). So good to raise an error here to let the user know about the new API.
  • Also, might as well also assert / raise an error if that engine_kwargs is only not None when engine="slurm"/

JoeZiminski avatar Jul 29 '24 14:07 JoeZiminski

Hi @JoeZiminski I tried to integrate your comments to my PR. Hopefully I did it right. The only thing I did not do was

Also, might as well also assert / raise an error if that engine_kwargs is only not None when engine="slurm"/

I wasn't entirely sure what you meant, but I think sbatch would launch with default parameters if engine_kwargs=None, so that shouldn't be a problem? Also, I didn't know how to write a test to check that the exception was raised correctly.

MarinManuel avatar Aug 16 '24 01:08 MarinManuel

Hey @MarinManuel, so sorry for the delay with this! I'm around again now, please feel free to ping me and we can merge this ASAP.

Just a couple of small things I think:

  • please see a couple of small comments on the tests
  • there are a few github suggestions (e.g. from Zach) about [edit: above]. These can be commited by selected 'commit suggestion', changing the commit message (e.g. just 'docstring fix' or something is okay) and then 'commit changes'. It's a cool github feature!
  • Any other comments that are outdated or have been addressed can be resolved.

Also, might as well also assert / raise an error if that engine_kwargs is only not None when engine="slurm"

Sorry this was the other way round, in case a user who is unfamiliar with SLURM e.g. using 'dask' say but passes the engine_kwargs under some false impression these will do something for all engines. Alternatively we could rename engine_kwargs to slurm_engine_kwargs as they only relate to slurm. Actually this is better, maybe they can be called slurm_engine_kwargs or just slurm_kwargs?

After this I can do one last pass and test locally again then I think it's good to go! Apologies the review took so long, this new feature is a super-useful addition.

JoeZiminski avatar Sep 10 '24 10:09 JoeZiminski

Hi @JoeZiminski Please let me know how that looks

MarinManuel avatar Sep 11 '24 18:09 MarinManuel

Hi @MarinManuel, I'm incredibly sorry I misunderstood from the docstring how engine kwargs was intended, I thought it was just for SLURM from the below:

old docstring
engine_kwargs : dict
        In the case of engine="slum", possible kwargs are:
        - tmp_script_folder: str, default None
            the folder in which the job scripts are created. Default: directory created by
            the `tempfile` library
        - sbatch_args: dict
          arguments to be passed to sbatch. They will be automatically prefixed with --.
          Arguments must be in the format slurm specify, see the
          [documentation for `sbatch`](https://slurm.schedmd.com/sbatch.html) for a list of possible arguments

It might be worth reverting that change such that the all kwargs come under 'engine kwargs' as before. The new docstring can look as below. I'm happy to do this if you give me access to your fork as I appreciate it will be annoying to undo something you just did due to my mistake!

New docstring
    engine_kwargs : dict
        Parameters to be passed to the underlying engine.
        Defaults are:
        * loop : None
        * joblib : n_jobs=1, backend="loky"
        * multiprocessing : max_workers=2, mp_context=None
        * dask : client=None
        * slurm : tmp_script_folder=None
        for the `slurm` entry:
            This dictionary contains arguments to be passed to sbatch.
            They will be automatically prefixed with --.
            Arguments must be in the format slurm specify, see the
            [documentation for `sbatch`](https://slurm.schedmd.com/sbatch.html) for a list of possible arguments

Some other very minor things that came up when manually testing, then I think good to go!

  • Need to add pytest-mock here and here in the pyproject.toml as it was a new dependency added in the test.
  • the test is failing on windows as some os commands used in the run_sorter_job assume we are on Linux. In the end we should skip only when not running on Linux. We need to add from platform import system at the top of test_launcher.py and @pytest.mark.skipif(system() != 'Linux', reason="Assumes we are on Linux to run SLURM").
  • I was manually testing and it was hard to figure out what was being passed to slurm by the CLI. I think we can add a quick print to before the subprocess call to make this clear, e.g. adding print(f"subprocess called with command {progr}\n") here.

Then all good from my end, sorry about this last-minute back and forth and thanks again for this super useful addition!

JoeZiminski avatar Sep 12 '24 10:09 JoeZiminski

Hi @MarinManuel thanks for this, apologies please resend the invite I will fix the tests and conflicts, thanks!

JoeZiminski avatar Sep 25 '24 17:09 JoeZiminski

Hi @MarinManuel. Thanks a lot for this improvement. We ultra sorry for the long delay. This is OK for us.

Thank you @JoeZiminski for taking care of this.

samuelgarcia avatar Jun 12 '25 08:06 samuelgarcia