dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

feat: Add RemoteSLURMCluster & RemoteSLURMJob #502

Open alexandervaneck opened this issue 4 years ago • 14 comments

What does this PR change?

It adds support for RemoteSLURMCluster which spins up dask-workers through the SLURM REST API instead of through subprocess.Popen.

Why is this useful?

In environments where the python process doesn't have direct access to the SLURM CLI (f.e. a docker container) we still want to be able to use dask-jobqueue to interface with SLURM.

Open questions

  1. I've been able to write some minimal end-to-end test. The implementation largely inherits from SLURMCluster, which is sufficiently tested. Where could we invest more in tests?
  2. I've not been able to test this with a workin SLURM REST API cluster, since I am still waiting on my colleagues to activate it. I'm opening this PR early to receive early reviews.

Thank you for allowing me to contribute 🙇‍♂️ @jacobtomlinson @guillaumeeb

Resolve #502

alexandervaneck avatar Jul 05 '21 13:07 alexandervaneck

TODO: Add SLURM REST API to ci/slurm and remove mocking from test_end_to_end. We can use the slurm cluster instead.

(cc @jacobtomlinson I think this is a better idea)

alexandervaneck avatar Jul 05 '21 16:07 alexandervaneck

Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good.

(Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews)

jacobtomlinson avatar Jul 06 '21 09:07 jacobtomlinson

Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good.

(Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews)

I unfortunately have to remove this test because the aresponses dependency cannot be installed by conda. In any case I'm currently working on making the tests work with the ci/slurm docker environment. That should provide plenty of test coverage for us to accept this PR in good conscience.

(I'll stop force pushing :D)

alexandervaneck avatar Jul 06 '21 09:07 alexandervaneck

@lesteve Would you have any tips on how I can update the ci/slurm integration test docker setup and have it be used during integration testing of slurm?

For this PR to pass we would need;

  1. Update slurm version to 20.11 (the version the REST API was released in)
  2. Add slurmrestd somehow to have an endpoint we can send requests to in the tests.

Update slurm

I pushed an image to dockerhub and added it to the FROM section in the ci/slurm/Dockerfile

slurmrestd

Perhaps you know, off the top of your head, where I should look for configuration examples to configure this? any direction you'd prefer would be appreciated.

I'll make time to look at this today 👍

Alternatively I could update this PR with unit tests only (with mocking) and we tackle those integration tests in another PR (due to ci.yml action needing to build the newer slurm docker image later).

alexandervaneck avatar Jul 07 '21 08:07 alexandervaneck

Please stop force pushing 😂. I can't see what changes you are making.

image

jacobtomlinson avatar Jul 08 '21 13:07 jacobtomlinson

Please stop force pushing 😂. I can't see what changes you are making.

image

I'll try my absolute best. 😬 it's part of my normal workflow to rebase/retouch commits to make them look nice and to the point.

alexandervaneck avatar Jul 08 '21 13:07 alexandervaneck

Update slurm version to 20.11 (the version the REST API was released in)

As you might see, we were using images from https://github.com/giovtorres/slurm-docker-cluster, but it has not been updated on dockerhub for two years. The simplest thing to do here would probably to copy paste the Dockerfile from this project, update Slurm version, and to build and push the image through our CI to daskdev docker user. It could then be used for running Python tests.

Add slurmrestd somehow to have an endpoint we can send requests to in the tests.

Never configured a Slurm cluster, and I'm not sure any maintainer here has... So you'll have to search or try, since https://slurm.schedmd.com/rest.html does not seem to give all needed information... In the end this is probably some lines of configuration, and one more daemon to launch in the docker-compose setup.

Alternatively I could update this PR with unit tests only (with mocking) and we tackle those integration tests in another PR (due to ci.yml action needing to build the newer slurm docker image later).

This is an alternative, as updating CI here might be a bit complicated, and we've already done it. But It is a really important thing to do though.

guillaumeeb avatar Jul 12 '21 19:07 guillaumeeb

@jacobtomlinson I got it to work with Unix sockets as authentication. Slurm does not accept unauthenticated requests. Let me know what you think of this implementation.

Something inside Slurm is making it impossible to share the unix socket via the docker-compose file (it errors as permission error), therefore we have to run those tests in the slurmrestd container directly.

JWT authentication does not seem to work yet, tomorrow I'll investigate if the latest version 20-11-8-1 can make it work. 20-11-4-1 (current version in Dockerfile) seems to be broken.

@guillaumeeb thank you for your insights.

The simplest thing to do here would probably to copy paste the Dockerfile from this project, update Slurm version, and to build and push the image through our CI to daskdev docker user. It could then be used for running Python tests.

I had the same thought and made a new slurm docker image definition for us. :)

In the end this is probably some lines of configuration, and one more daemon to launch in the docker-compose setup.

You were right 👍 however I must say that the slurm docs leave a lot to be desired. I ended up reading their code to make sense of the configuration/requests.

alexandervaneck avatar Jul 12 '21 23:07 alexandervaneck

I've managed to also add the JWT HTTP configuration - however the integration tests fail. Given that the posted payload is correct it leads me to believe that slurmrestd doesn't actually have JWT support yet.

The NEWS in slurm mentions a couple of bugfixes in version later than 20-11-4-1 (the version currently installable in ubuntu) https://github.com/SchedMD/slurm/blob/master/NEWS. So I'll have to investigate if something is wrong with the token generation/authentication or this is a bug in Slurm.

alexandervaneck avatar Jul 13 '21 12:07 alexandervaneck

I've managed to also add the JWT HTTP configuration - however the integration tests fail. Given that the posted payload is correct it leads me to believe that slurmrestd doesn't actually have JWT support yet.

The NEWS in slurm mentions a couple of bugfixes in version later than 20-11-4-1 (the version currently installable in ubuntu) https://github.com/SchedMD/slurm/blob/master/NEWS. So I'll have to investigate if something is wrong with the token generation/authentication or this is a bug in Slurm.

I figured out JWT authentication, now still need to figure out why we're seeing;

Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f39d33c7048>, 1059921.899742882)]']
connector: <aiohttp.connector.UnixConnector object at 0x7f3a0578d7f0>
Task was destroyed but it is pending!
task: <Task pending coro=<AdaptiveCore.adapt() done, defined at /opt/anaconda/lib/python3.6/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f39d34bc5e8>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py:688]>

🤔

EDIT:

Turns out

Task was destroyed but it is pending!
task: <Task pending coro=<AdaptiveCore.adapt() done, defined at /opt/anaconda/lib/python3.6/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fd92fdbf768>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py:688]>

is always raised - even when using SLURMCluster directly. @jacobtomlinson @lesteve this doesn't seem like I have to look into more right?

alexandervaneck avatar Jul 13 '21 13:07 alexandervaneck

@jacobtomlinson @lesteve this is now ready for a good review, all the features explained in the issue are in. Please let me know your thoughts 🙇 your thoughts help immensely.

alexandervaneck avatar Jul 13 '21 15:07 alexandervaneck

I found a bit of a snag working with aiohttp. Due to SLURM responses sometimes (most of the time?) not being parsable the http-parser library fails on reading the response. Unfortunately I have not been able to deduce why these responses cannot be read.

In any case setting AIOHTTP_NO_EXTENSIONS as suggested by https://github.com/aio-libs/aiohttp/issues/1843 fixes this. I have however no idea how we should control for this in _submit_job. 🤔

alexandervaneck avatar Jul 29 '21 09:07 alexandervaneck

@AlexanderVanEck I see you push some commits about a month ago. Do you still feel this PR is ready to review?

guillaumeeb avatar Aug 31 '21 14:08 guillaumeeb