reframe icon indicating copy to clipboard operation
reframe copied to clipboard

[feat] Add support for the Flux scheduler

Open vsoch opened this issue 3 years ago • 20 comments

This will add a basic scheduler that can interact with Flux: https://github.com/flux-framework/flux-core

Today is the first day I've ever used (and developed for) reframe, so I will need to ask for guidance on the best way to run tests, as either we will need a container base with flux or to mock something. The way I was able to develop and test is by adding a tutorial that I ran in a container! I also suspect I'll need to add proper documentation, if you could point me to the best places to do that. Thank you!

  • [ ] Discuss testing strategy
  • [x] Documentation updates?

Signed-off-by: vsoch [email protected]

vsoch avatar Sep 16 '22 20:09 vsoch

Can I test this patch?

jenkins-cscs avatar Sep 16 '22 21:09 jenkins-cscs

Maybe @jenkins-cscs ... maybe! I just need an education for how to best go about it.

vsoch avatar Sep 16 '22 21:09 vsoch

Hi @vsoch and thanks for you contribution! And also for taking the time to provide a full tutorial as well! Regarding testing, I think that testing it through a container triggered by a Github action is totally fine. We do the same for the various modules systems backends. As for the scheduler backends, we don't use Github actions yet, because simply we didn't have a containerised workflow for them and, therefore, we're relying on the @jenkins-cscs buddy. I know that @teojgo has been experimenting with testing a Slurm installation for CI/CD with containers, so ideally, we want to move that to GH actions, too.

As for the docs, they are in RST format (sorry for that, I see your tutorial –of course, logically– is in Markdown). We will need a simple entry in the list of available scheduler backends with the link to the Flux docs.

As for your tutorial, if you could possibly convert it to RST, that'd be great. Then you can add an entry here and it will be picked up, built and listed in the tutorials page.

vkarak avatar Sep 17 '22 16:09 vkarak

Will get these fixes in soon! And rst is totally good - I use it for a lot of my docs as well.

vsoch avatar Sep 17 '22 17:09 vsoch

ok! This took a hot minute, but the new workflow should be ready to test here!

vsoch avatar Sep 17 '22 18:09 vsoch

Okay changes are in!

You could expand the unit tests in test_schedules.py and run the unit tests inside the container with your settings file using the --rfm-user-config and possibly the --rfm-system options of ./test_reframe.py. I could also give it a try this week.

And yes I'd be happy to do that - are there a standard set that you like to run for reframe, or should I ping the Flux core devs to see what they would use?

vsoch avatar Sep 19 '22 20:09 vsoch

And yes I'd be happy to do that - are there a standard set that you like to run for reframe, or should I ping the Flux core devs to see what they would use?

No, these are unit tests for reframe to test the various scheduler backends, practically, that the API is implemented correctly. They are in the unittests/test_schedulers.py. Apart from the Slurm-specific ones (check the slurm_only fixture), there are some that are meant to be valid for all schedulers. These are the test_prepare* unit tests which exercise the emit_preamble() for the various schedulers (that should be trivial for flux to implement), the test_submit*, the test_cancel* and a couple more that check some corner cases. For the test_prepare* family of tests you essentially need to provide an implementation of the directives that you expect in the form of _expected_flux_directives(job) and _expected_flux_directives_minimal(job). These exercise that the backends emit the right preamble if they get a minimally defined job or one with more details. For the rest, I would simply add flux in the parameterized scheduler fixture and wait to see the failures and try to fix them :-) The tests that are meant to run only for a specific backend will be skipped, so you don't need to worry about them. But you're not done yet! If you try to run the unit tests right away, only those that can be executed locally will run (the test_prepare* practically). You need to pass a custom config that defines a partition with a flux scheduler and then run the unit tests with ./test_reframe.py --rfm-user-config=flux-config.py. Inside you container, you could also run reframe's unit tests this way before your tutorial example. I hope it's less complicated than it sounds, but l am happy to help!

vkarak avatar Sep 20 '22 18:09 vkarak

Ok, I can take a shot hopefully soon! I'll ping you if I have any questions.

vsoch avatar Sep 20 '22 18:09 vsoch

@vkarak I took a first shot! I got the tests to run locally so (fingers crossed) the container in the CI environment works too!

https://github.com/reframe-hpc/reframe/pull/2598/commits/5109b8ec6ca732e8d293d64865e43269187f0664

I'm glad we did this - I was totally missing functionality to check for the max wait time and to cancel.

vsoch avatar Sep 21 '22 00:09 vsoch

And it looks like we need to disable running flux tests for the normal CI - how would you like to try doing that? Just a figure that is based on being able to import flux? Also I forgot to run the tests with flux start in the container - doh!

vsoch avatar Sep 21 '22 19:09 vsoch

And it looks like we need to disable running flux tests for the normal CI - how would you like to try doing that? Just a figure that is based on being able to import flux?

I've seen that and I'm trying to figure out what's the way to do that. It's the first scheduler backend that uses Python bindings of some sort. One option is to skip the test if you get a JobSchedulerError in the make_job fixture. Another could be to avoid registering the backend if it's not there.

vkarak avatar Sep 21 '22 20:09 vkarak

OK just made a tweak that (fingers crossed) should result in green! python isn't in the container (but python3 is).

vsoch avatar Sep 22 '22 09:09 vsoch

Codecov Report

Base: 86.69% // Head: 86.18% // Decreases project coverage by -0.50% :warning:

Coverage data is based on head (621e88b) compared to base (9f0d167). Patch coverage: 36.28% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2598      +/-   ##
==========================================
- Coverage   86.69%   86.18%   -0.51%     
==========================================
  Files          59       60       +1     
  Lines       10868    10975     +107     
==========================================
+ Hits         9422     9459      +37     
- Misses       1446     1516      +70     
Impacted Files Coverage Δ
reframe/core/schedulers/flux.py 31.63% <31.63%> (ø)
reframe/core/systems.py 89.95% <50.00%> (-1.40%) :arrow_down:
reframe/core/backends.py 93.93% <85.71%> (+3.93%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Sep 22 '22 20:09 codecov-commenter

Ah so close!! Linting failed!! I added assert flux so the variable is used. :crossed_fingers:

vsoch avatar Sep 22 '22 20:09 vsoch

I might need some help thinking through this one: https://github.com/reframe-hpc/reframe/actions/runs/3104434148/jobs/5037509247#step:6:449 as it works for me locally (so I cannot reproduce to debug!)

vsoch avatar Sep 22 '22 20:09 vsoch

I might need some help thinking through this one: https://github.com/reframe-hpc/reframe/actions/runs/3104434148/jobs/5037509247#step:6:449 as it works for me locally (so I cannot reproduce to debug!)

You were faster than me! I was about to open a PR to your branch, but you just rebased your branch and invalidated mine :-P I will rebase mine and open it now. It improves how scheduler (and generally) backends are registered, so that you get nicer messages when you try to use flux and the bindings are not there:

reframe: failed to initialize runtime: failed to initialize partition 'tresa:default': could not register scheduler backend: no flux Python bindings found

And in the unit tests:

unittests/test_schedulers.py::test_prepare[flux] SKIPPED (could not register scheduler backend: no flux Python bindings found)                                                                           [  0%]

vkarak avatar Sep 22 '22 21:09 vkarak

Oops sorry about that! I'll wait for your PR before I take another shot!

vsoch avatar Sep 22 '22 21:09 vsoch

My PR is done now!

vkarak avatar Sep 22 '22 21:09 vkarak

OK - my hypothesis is that the bug was a result of adding another miniconda environment, and installing pytest to it, so then even though we use flux start, the pytest we use is from that environment and somehow can't see flux. We will find out soon!

vsoch avatar Sep 22 '22 21:09 vsoch

okay I'm going to look closer at the test_cancel_with_grace test- there must be something different on my host than in the CI.

vsoch avatar Sep 22 '22 21:09 vsoch

ok - two test failures that I still can't reproduce locally. Any things I can try?

vsoch avatar Sep 25 '22 15:09 vsoch

I could try increasing the sleep again in case flux needs more time to cancel given the CI worker (maybe it's slower?)

vsoch avatar Sep 25 '22 16:09 vsoch

okay I have an idea! Flux also has a kill functionality for a job - the cancel might be too soft here.

vsoch avatar Sep 25 '22 16:09 vsoch

okay just pushed a change so the job is killed and rebased with master. If that doesn't work I think we need to look more closely at exactly what PID is expected to be killed (vs. what Flux is doing).

vsoch avatar Sep 25 '22 17:09 vsoch

@vsoch The problem with the unit test failure is not Flux's problem. It's something wrong with the local scheduler implementation; I will have a look at it and try to fix it. This test is a bit flaky and I also cannot reproduce it locally with your container 😬

vkarak avatar Sep 25 '22 18:09 vkarak

okay sounds good! I'll be on my computer most of the day so please ping me if I can be of help. Really excited to get this working!

vsoch avatar Sep 25 '22 18:09 vsoch

Still no luck, but I just noticed that you run the unit tests and reframe etc. in the container using flux, like flux start .... Could you try removing that and see if Github actions become happy again? (PS: I could not reproduce it locally even with flux start...)

Also the intention of that unit test is to check that the local scheduler kills any deeply spawned process. Practically, the local scheduler spawns its job as a new session, so that later when it kills it, all the processes in the session get killed too.

vkarak avatar Sep 25 '22 19:09 vkarak

Without flux start we won't have a flux instance running - so I don't think the flux tests would pass. Using flux start (with the command) is the typical / suggested use case, and the other option would be interactive, e.g.,:

$ flux start --test-size=4
# the above is akin to starting an interactive shell you can exit from
# interact with flux here

if we start a flux instance before (via an interactive shell) that wouldn't work in the CI. If you just want to do a sanity check to remove flux start I can do it, but I don't think it's going to work.

vsoch avatar Sep 25 '22 20:09 vsoch

okay - just pushed a skip condition that will skip the failing tests when running with flux, and run them after. This should serve as a sanity check to see if the local tests are failing because of flux start. If it's not that, my next guess is something about running in the container.

vsoch avatar Sep 25 '22 20:09 vsoch

if we start a flux instance before (via an interactive shell) that wouldn't work in the CI. If you just want to do a sanity check to remove flux start I can do it, but I don't think it's going to work.

Hmm, until I spotted this I was running the unit tests in the container as usual without flux start and they seemed to work:

unittests/test_schedulers.py::test_submit[flux] PASSED
unittests/test_schedulers.py::test_submit_max_pending_time[flux] PASSED

vkarak avatar Sep 25 '22 20:09 vkarak