reframe Keeping a test system busy, new execution policy?

I would like to extend reframe for the use case of generating a constant workload on a test/ warm spare system.

./reframe -C my_config.py --tag workload1 -r  --exec-policy=continuous --queue-depth=10

Where the main loop would be something like:

while queue_depth < arg.queue_depth:
    start(random.choice(tests))

with the desired behavior to always maintain a queue of queue_depth pending jobs for the scheduler to evaluate.

I am willing to do the implementation myself, but would like some feedback on the reception of this idea and possibly some suggestions on how to implement it. e.g. Should it be a new execution policy? If so what should it be called?

Nov 30 '18 22:11 brandongc

@brandongc Could you please explain a bit more? I am kind of getting the big picture, but I'm missing some information. What is the stop criterion for this mode of execution? Time limit, keyboard interrupt or running out of tests?

It seems that you want to test the scheduler, but what aspects do you want to test? It's throughput or how well it schedules?

Currently, the asynchronous execution policy submits up to max_jobs simultaneous jobs and tries to keep the running tasks (i.e., submitted jobs) close to this limit until it runs out of tests. At first, this sounds close to what you want to do, doesn't it?

Dec 03 '18 07:12 vkarak

Sure. I am thinking of the first two stop criterion. And the idea is not just testing the scheduler, but also the hardware, and system settings (e.g. Linux kernel version/ options). I am also working on trying to run some performance tools system-wide so it would be great to have a known workload with built-in performance tests.

Some use cases I am interested in (in no particular order)

We have a testing and development system that also serves as "hot" spares for the production machines. However, this system is idle 99% of the time. We could run linpack on the loop, but it would be preferable to have an approximation of the real production workload running at all times to stress the hardware. This would like be implemented via a systemd script which on system boot would automatically start filling the queues.

Evaluating the utility of system wide collection of performance counters (e.g. L2 miss rate). In this case having a known workload running for a fixed time period would be beneficial.

Measure changes to the scheduler settings (e.g. planning vs backfill stage time in Slurm) impact on metrics like submit vs start time distributions. This needs some longer running time as schedulers take some time to reach a steady state of behavior with respect to job placement, etc.

And in additional the usual use case of regression testing, running tests more than once could allow collection of some statistics on the performance numbers. (e.g. did our system change increase variability?)

Dec 03 '18 17:12 brandongc

I see what you are trying to do and it makes sense. We have also been thinking of a similar functionality, but it's not high in our priority list, since we have Jenkins for launching ReFrame repeatedly.

From the implementation point of view, what you are looking for is more of another type of Runner rather than an ExecutionPolicy. The Runner essentially implements the 3-nested loop that loops over the regression tests, system partitions and programming environments. The execution policies are responsible for running the different steps of the regression test pipeline and do all the necessary book-keeping. I think, the async policy fits exactly what you need to do: have ReFrame spawn jobs asynchronously. All you need to do is to feed it with an infinite sequence of tests; that's why I suggested another Runner. In fact, even the current Runner could do the job (partially), if you were calling it here as follows:

                runner.runall(itertools.cycle(checks_matched))

This would make ReFrame run indefinitely until you type Ctrl-C! The only thing remaining is implementing a time-limit for ReFrame's execution, that would cause it to exit.

Dec 03 '18 18:12 vkarak

There is a small complication with what I've suggested; how to deal with the stage and output directories, since just cycling over the tests will eventually make them step on each other, if you are using the async policy. Nonetheless, I believe that this is the direction you need to look into for the implementation.

Dec 03 '18 19:12 vkarak

OK thanks for the great feedback and suggestions.

Dec 03 '18 19:12 brandongc

Hey @brandongc, did you have any progress with this?

Feb 04 '19 15:02 vkarak

Not yet, unfortunately with a lot of higher priority work and with holidays I have not worked on this. It is still in my queue though.

Feb 14 '19 16:02 brandongc

Sure! I was just checking in, cos it's an interesting use case.

Feb 14 '19 16:02 vkarak

Bumping up (again) this issue's priority as it's something that we need, too.

Mar 16 '23 13:03 vkarak