Hv issue719 job manager threaded job start
@soxofaan maybe good to rediscuss from this point (local unit tests are passing)
by the way, I don't think you had to create a new PR, you just could have continued on https://github.com/Open-EO/openeo-python-client/pull/730 by pushing to its feature branch (we now have three open PRs for this feature I think, which gets a bit messy)
worked on your initial comments @soxofaan
currently checking why some unit tests are not passing
@soxofaan I do not think the failing unit test in 3.11 are due to my changes?
indeed, that problem has been fixed on master by now
One issue I still see now and want to resolve is that currently launching jobs (to queue them for start) is still tied in to the backend load.
This means than when running at full capacity you will not already 'precreate jobs' which can instantly start running
when running at full capacity you will not already 'precreate jobs' which can instantly start running
I'm not sure that pre-creating jobs instead of on the fly like we currently do will make that big of a difference as the time to create a job is usually negligible compared to the required time to start a job. Or do you experience different timings?
@soxofaan adjusted the unit tests and resolved the small bugs.
I also proposed to split off the _JobManagerWorkerThread
to allow it in the future to also contain alternative threads for downloading, jobworkerthreadpools, .....
@soxofaan as discussed, removed the queues from the design and using the internal threadpool queues.
@soxofaan not sure why these test are failing, but I think the code is at a good point to reevaluate.
Uncertain wheter we need to post process after shutting down the worker pool, given the
while ( sum( job_db.count_by_status( statuses=["not_started", "created", "queued", "queued_for_start", "running"] ).values()
loop.
As long if we have not started or queued for start states (on which the postprocessing would touch, we remain in the loop)
Uncertain wheter we need to post process after shutting down the worker pool, given the
good point, however, that probably only works out now with doing the start in a side thread. Once we add threaded result downloading or other features, that while(sum(...)) is not going to guarantee that all the (side) work is done yet.
@soxofaan ready for review
Ran a small stress test for 30 short lived jobs (10 parallel jobs).
Total time (standard): 2728.00 seconds Total time (threaded): 2147.00 seconds Total time gain: 581.00 seconds (21.30% faster)
So the time between creating a job and running a job became 20% shorter. This does need to put in perspective that these gains are small vs the actual duration of entire openEO jobs...
FYI: I merged master in your feature branch hv_issue719-job-manager-threaded-job-start to trigger a new build
(just pushed a merge of master to resolve conflict and trigger a test rerun)
ok tests pass again :partying_face:
some remaining todos
- [x] changelog entry covering both the threaded aspect, as well as the new API
JobDatabaseInterface.get_by_indicesfor job db implementers to implement - [x] double check the lost sleep in the dummy task (see higher)
Awesome; shall I pick up here for the lost sleep dummy task?
I was already looking into DummyTask anyway, because my IDE (pycharm) flags some issues about writing to read-only attributes
FYI: as this is a very long running PR with messy history, I tried to rebase the feature branch of this PR to clean it up a bit (e.g. commit squashing) to improve the signal-noise ratio and created PR #806 . The diff is practically identical to #736 (except for one empty line).
The discussion can stay here (#736), but #806 will be merged in the end
ok, I decided to merge this (through PR #806) in 41710047e0f02ba546de14390f1d390b0e2f6a5b :partying_face: