openeo-python-client icon indicating copy to clipboard operation
openeo-python-client copied to clipboard

Hv issue719 job manager threaded job start

Open HansVRP opened this issue 11 months ago • 12 comments

HansVRP avatar Feb 19 '25 15:02 HansVRP

@soxofaan maybe good to rediscuss from this point (local unit tests are passing)

HansVRP avatar Feb 19 '25 15:02 HansVRP

by the way, I don't think you had to create a new PR, you just could have continued on https://github.com/Open-EO/openeo-python-client/pull/730 by pushing to its feature branch (we now have three open PRs for this feature I think, which gets a bit messy)

soxofaan avatar Feb 20 '25 11:02 soxofaan

worked on your initial comments @soxofaan

currently checking why some unit tests are not passing

HansVRP avatar Feb 21 '25 14:02 HansVRP

@soxofaan I do not think the failing unit test in 3.11 are due to my changes?

indeed, that problem has been fixed on master by now

soxofaan avatar Feb 24 '25 16:02 soxofaan

One issue I still see now and want to resolve is that currently launching jobs (to queue them for start) is still tied in to the backend load.

This means than when running at full capacity you will not already 'precreate jobs' which can instantly start running

HansVRP avatar Feb 25 '25 12:02 HansVRP

when running at full capacity you will not already 'precreate jobs' which can instantly start running

I'm not sure that pre-creating jobs instead of on the fly like we currently do will make that big of a difference as the time to create a job is usually negligible compared to the required time to start a job. Or do you experience different timings?

soxofaan avatar Feb 25 '25 15:02 soxofaan

@soxofaan adjusted the unit tests and resolved the small bugs.

I also proposed to split off the _JobManagerWorkerThread

to allow it in the future to also contain alternative threads for downloading, jobworkerthreadpools, .....

HansVRP avatar Feb 26 '25 11:02 HansVRP

@soxofaan as discussed, removed the queues from the design and using the internal threadpool queues.

HansVRP avatar Mar 14 '25 15:03 HansVRP

@soxofaan not sure why these test are failing, but I think the code is at a good point to reevaluate.

Uncertain wheter we need to post process after shutting down the worker pool, given the

while ( sum( job_db.count_by_status( statuses=["not_started", "created", "queued", "queued_for_start", "running"] ).values()

loop.

As long if we have not started or queued for start states (on which the postprocessing would touch, we remain in the loop)

HansVRP avatar Mar 20 '25 10:03 HansVRP

Uncertain wheter we need to post process after shutting down the worker pool, given the

good point, however, that probably only works out now with doing the start in a side thread. Once we add threaded result downloading or other features, that while(sum(...)) is not going to guarantee that all the (side) work is done yet.

soxofaan avatar Mar 21 '25 14:03 soxofaan

@soxofaan ready for review

HansVRP avatar Apr 18 '25 15:04 HansVRP

Ran a small stress test for 30 short lived jobs (10 parallel jobs).

image

Total time (standard): 2728.00 seconds Total time (threaded): 2147.00 seconds Total time gain: 581.00 seconds (21.30% faster)

So the time between creating a job and running a job became 20% shorter. This does need to put in perspective that these gains are small vs the actual duration of entire openEO jobs...

HansVRP avatar Apr 18 '25 18:04 HansVRP

FYI: I merged master in your feature branch hv_issue719-job-manager-threaded-job-start to trigger a new build

soxofaan avatar May 22 '25 15:05 soxofaan

(just pushed a merge of master to resolve conflict and trigger a test rerun)

soxofaan avatar May 26 '25 09:05 soxofaan

ok tests pass again :partying_face:

some remaining todos

  • [x] changelog entry covering both the threaded aspect, as well as the new API JobDatabaseInterface.get_by_indices for job db implementers to implement
  • [x] double check the lost sleep in the dummy task (see higher)

soxofaan avatar Aug 20 '25 10:08 soxofaan

Awesome; shall I pick up here for the lost sleep dummy task?

HansVRP avatar Aug 20 '25 12:08 HansVRP

I was already looking into DummyTask anyway, because my IDE (pycharm) flags some issues about writing to read-only attributes

soxofaan avatar Aug 20 '25 12:08 soxofaan

FYI: as this is a very long running PR with messy history, I tried to rebase the feature branch of this PR to clean it up a bit (e.g. commit squashing) to improve the signal-noise ratio and created PR #806 . The diff is practically identical to #736 (except for one empty line).

The discussion can stay here (#736), but #806 will be merged in the end

soxofaan avatar Sep 09 '25 09:09 soxofaan

ok, I decided to merge this (through PR #806) in 41710047e0f02ba546de14390f1d390b0e2f6a5b :partying_face:

soxofaan avatar Sep 09 '25 14:09 soxofaan