openeo-python-client Hv issue719 job manager threaded job start

Feb 19 '25 15:02 HansVRP

@soxofaan maybe good to rediscuss from this point (local unit tests are passing)

Feb 19 '25 15:02 HansVRP

by the way, I don't think you had to create a new PR, you just could have continued on https://github.com/Open-EO/openeo-python-client/pull/730 by pushing to its feature branch (we now have three open PRs for this feature I think, which gets a bit messy)

Feb 20 '25 11:02 soxofaan

worked on your initial comments @soxofaan

currently checking why some unit tests are not passing

Feb 21 '25 14:02 HansVRP

@soxofaan I do not think the failing unit test in 3.11 are due to my changes?

indeed, that problem has been fixed on master by now

Feb 24 '25 16:02 soxofaan

One issue I still see now and want to resolve is that currently launching jobs (to queue them for start) is still tied in to the backend load.

This means than when running at full capacity you will not already 'precreate jobs' which can instantly start running

Feb 25 '25 12:02 HansVRP

when running at full capacity you will not already 'precreate jobs' which can instantly start running

I'm not sure that pre-creating jobs instead of on the fly like we currently do will make that big of a difference as the time to create a job is usually negligible compared to the required time to start a job. Or do you experience different timings?

Feb 25 '25 15:02 soxofaan

@soxofaan adjusted the unit tests and resolved the small bugs.

I also proposed to split off the _JobManagerWorkerThread

to allow it in the future to also contain alternative threads for downloading, jobworkerthreadpools, .....

Feb 26 '25 11:02 HansVRP

@soxofaan as discussed, removed the queues from the design and using the internal threadpool queues.

Mar 14 '25 15:03 HansVRP

@soxofaan not sure why these test are failing, but I think the code is at a good point to reevaluate.

Uncertain wheter we need to post process after shutting down the worker pool, given the

while ( sum( job_db.count_by_status( statuses=["not_started", "created", "queued", "queued_for_start", "running"] ).values()

loop.

As long if we have not started or queued for start states (on which the postprocessing would touch, we remain in the loop)

Mar 20 '25 10:03 HansVRP

Uncertain wheter we need to post process after shutting down the worker pool, given the

good point, however, that probably only works out now with doing the start in a side thread. Once we add threaded result downloading or other features, that while(sum(...)) is not going to guarantee that all the (side) work is done yet.

Mar 21 '25 14:03 soxofaan

@soxofaan ready for review

Apr 18 '25 15:04 HansVRP

Ran a small stress test for 30 short lived jobs (10 parallel jobs).

Total time (standard): 2728.00 seconds Total time (threaded): 2147.00 seconds Total time gain: 581.00 seconds (21.30% faster)

So the time between creating a job and running a job became 20% shorter. This does need to put in perspective that these gains are small vs the actual duration of entire openEO jobs...

Apr 18 '25 18:04 HansVRP

FYI: I merged master in your feature branch hv_issue719-job-manager-threaded-job-start to trigger a new build

May 22 '25 15:05 soxofaan

(just pushed a merge of master to resolve conflict and trigger a test rerun)

May 26 '25 09:05 soxofaan

ok tests pass again :partying_face:

some remaining todos

[x] changelog entry covering both the threaded aspect, as well as the new API JobDatabaseInterface.get_by_indices for job db implementers to implement
[x] double check the lost sleep in the dummy task (see higher)

Aug 20 '25 10:08 soxofaan

Awesome; shall I pick up here for the lost sleep dummy task?

Aug 20 '25 12:08 HansVRP

I was already looking into DummyTask anyway, because my IDE (pycharm) flags some issues about writing to read-only attributes

Aug 20 '25 12:08 soxofaan

FYI: as this is a very long running PR with messy history, I tried to rebase the feature branch of this PR to clean it up a bit (e.g. commit squashing) to improve the signal-noise ratio and created PR #806 . The diff is practically identical to #736 (except for one empty line).

The discussion can stay here (#736), but #806 will be merged in the end

Sep 09 '25 09:09 soxofaan

ok, I decided to merge this (through PR #806) in 41710047e0f02ba546de14390f1d390b0e2f6a5b :partying_face:

Sep 09 '25 14:09 soxofaan