feat: Add background-task for kernel-pull-progress
refs backend.ai-issue#227
- Manager TODOs
- [x] Spawn a background task when creating a new session, returning the task ID to the client after max_wait seconds.
- [x] Let the background task relay the kernel_pull_progress events.
About Background-Task
The bgtask contains two asynchronous functions, one of them selects kernel's status from 'kernels' table, the other updates progress. ~To ensure synchronization, I use Asyncio-lock.~ Update_progress function updates progress status, reacting to KernelPullprogress Event. And, Progress Reporter is updated by the function. So, manager sends the progress-reporter to client, and, client displays kernel-pull-progress using the reporter.
Codecov Report
Merging #477 (60bfc69) into main (15acec7) will not change coverage. The diff coverage is
n/a.
:exclamation: Current head 60bfc69 differs from pull request most recent head e2f01f4. Consider uploading reports for the commit e2f01f4 to get more accurate results
@@ Coverage Diff @@
## main #477 +/- ##
=======================================
Coverage 48.89% 48.89%
=======================================
Files 54 54
Lines 8908 8908
=======================================
Hits 4356 4356
Misses 4552 4552
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 15acec7...e2f01f4. Read the comment docs.
I have finished to fix the code reflected your feedback. And, following images show difference between when max_wait is 0 or not.
-
Value of max-wait is equal 0

-
Set max-wait seconds

Also, the code in client-py is fixed. Formally, I don't check 'background_task'. After fixed, I check 'background_task'. If true, kernel-pull-progress is displayed to client during pulling(downloading image). If not, client can't see the progress and wait it.
I have finished to fix the code reflected your feedback. And, following images show difference between when max_wait is 0 or not.
- Value of max-wait is equal 0
![]()
- Set max-wait seconds
![]()
Also, the code in client-py is fixed. Formally, I don't check 'background_task'. After fixed, I check 'background_task'. If true, kernel-pull-progress is displayed to client during pulling(downloading image). If not, client can't see the progress and wait it.
Just a note: Formally -> Formerly / Previously This made me confused for a moment. ;)
I have no idea how to send some data(e.g, kernel_id, root_ctx, ...) to this as separate function . So, I change this to callable class and separate this from '_create' function.
Callable class is generally considered an anti-pattern unless you are writing a kind of low-level library. Just pass any required information as additional arguments.
To handle multi-node scenarios (cluster sessions) where multiple agents generate the same KernelPullProgressEvent with their own progresses, I think there should be a logic to aggregate them in the session-level and calculate the "total" progress of the multiple agents. Maybe this could be handled in the client-side, by observing multiple bgtasks concurrently.
To handle multi-node scenarios (cluster sessions) where multiple agents generate the same
KernelPullProgressEventwith their own progresses, I think there should be a logic to aggregate them in the session-level and calculate the "total" progress of the multiple agents. Maybe this could be handled in the client-side, by observing multiple bgtasks concurrently.
To implement this in the client-side, we need to:
- manager: Return the multiple kernel IDs and bgtask IDs for individual kernels from
api.session._create() - client-py: Open multiple SSE connections to the bgtask IDs and show the per-kernel progress like multi-layer progress view in Docker CLI's pull command implementation, or aggregate the per-kernel progresses into a single progress bar.
To implement this in the manager-side:
- manager: Return the single bgtask ID just like now.
- manager: In the bgtask, aggregate the multiple kernel pull progress events that belongs to the current session (which may have multiple kernels) and return the aggregated result.
Personally, I think it's better to implement in the client side, as depending on the client-side implementation, we would have runtime options whether to expose more per-kernel details or show the overall progress only.
Sorry for the late reply. I have a question.
- command 1 :
backend.ai run --cluster-mode multi-node --cluster-size 2 'print("hello")' python - command 2 :
backend.ai run -c 'import os; print("Hello world, {}".format(os.environ["CASENO"]))' -r cpu=1 -r mem=320m -e 'CASENO=$X' --env-range=X=case:1,2 pythonWhich one does multi-node scenarios (cluster sessions) mean, command 1 or command 2?
And, if we implement the multi scenario in the client-side, is there no need to fix it anymore?