backend.ai-manager icon indicating copy to clipboard operation
backend.ai-manager copied to clipboard

feat: Add background-task for kernel-pull-progress

Open youngjun0627 opened this issue 4 years ago • 8 comments

refs backend.ai-issue#227

  • Manager TODOs
    • [x] Spawn a background task when creating a new session, returning the task ID to the client after max_wait seconds.
    • [x] Let the background task relay the kernel_pull_progress events.

About Background-Task

The bgtask contains two asynchronous functions, one of them selects kernel's status from 'kernels' table, the other updates progress. ~To ensure synchronization, I use Asyncio-lock.~ Update_progress function updates progress status, reacting to KernelPullprogress Event. And, Progress Reporter is updated by the function. So, manager sends the progress-reporter to client, and, client displays kernel-pull-progress using the reporter.

youngjun0627 avatar Sep 01 '21 10:09 youngjun0627

Codecov Report

Merging #477 (60bfc69) into main (15acec7) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 60bfc69 differs from pull request most recent head e2f01f4. Consider uploading reports for the commit e2f01f4 to get more accurate results Impacted file tree graph

@@           Coverage Diff           @@
##             main     #477   +/-   ##
=======================================
  Coverage   48.89%   48.89%           
=======================================
  Files          54       54           
  Lines        8908     8908           
=======================================
  Hits         4356     4356           
  Misses       4552     4552           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 15acec7...e2f01f4. Read the comment docs.

codecov[bot] avatar Sep 08 '21 06:09 codecov[bot]

I have finished to fix the code reflected your feedback. And, following images show difference between when max_wait is 0 or not.

  • Value of max-wait is equal 0 image

  • Set max-wait seconds image

Also, the code in client-py is fixed. Formally, I don't check 'background_task'. After fixed, I check 'background_task'. If true, kernel-pull-progress is displayed to client during pulling(downloading image). If not, client can't see the progress and wait it.

youngjun0627 avatar Sep 09 '21 07:09 youngjun0627

I have finished to fix the code reflected your feedback. And, following images show difference between when max_wait is 0 or not.

  • Value of max-wait is equal 0 image
  • Set max-wait seconds image

Also, the code in client-py is fixed. Formally, I don't check 'background_task'. After fixed, I check 'background_task'. If true, kernel-pull-progress is displayed to client during pulling(downloading image). If not, client can't see the progress and wait it.

Just a note: Formally -> Formerly / Previously This made me confused for a moment. ;)

achimnol avatar Sep 09 '21 07:09 achimnol

I have no idea how to send some data(e.g, kernel_id, root_ctx, ...) to this as separate function . So, I change this to callable class and separate this from '_create' function.

youngjun0627 avatar Oct 23 '21 08:10 youngjun0627

Callable class is generally considered an anti-pattern unless you are writing a kind of low-level library. Just pass any required information as additional arguments.

achimnol avatar Oct 31 '21 03:10 achimnol

To handle multi-node scenarios (cluster sessions) where multiple agents generate the same KernelPullProgressEvent with their own progresses, I think there should be a logic to aggregate them in the session-level and calculate the "total" progress of the multiple agents. Maybe this could be handled in the client-side, by observing multiple bgtasks concurrently.

achimnol avatar Oct 31 '21 03:10 achimnol

To handle multi-node scenarios (cluster sessions) where multiple agents generate the same KernelPullProgressEvent with their own progresses, I think there should be a logic to aggregate them in the session-level and calculate the "total" progress of the multiple agents. Maybe this could be handled in the client-side, by observing multiple bgtasks concurrently.

To implement this in the client-side, we need to:

  • manager: Return the multiple kernel IDs and bgtask IDs for individual kernels from api.session._create()
  • client-py: Open multiple SSE connections to the bgtask IDs and show the per-kernel progress like multi-layer progress view in Docker CLI's pull command implementation, or aggregate the per-kernel progresses into a single progress bar.

To implement this in the manager-side:

  • manager: Return the single bgtask ID just like now.
  • manager: In the bgtask, aggregate the multiple kernel pull progress events that belongs to the current session (which may have multiple kernels) and return the aggregated result.

Personally, I think it's better to implement in the client side, as depending on the client-side implementation, we would have runtime options whether to expose more per-kernel details or show the overall progress only.

achimnol avatar Oct 31 '21 05:10 achimnol

Sorry for the late reply. I have a question.

  • command 1 : backend.ai run --cluster-mode multi-node --cluster-size 2 'print("hello")' python
  • command 2 : backend.ai run -c 'import os; print("Hello world, {}".format(os.environ["CASENO"]))' -r cpu=1 -r mem=320m -e 'CASENO=$X' --env-range=X=case:1,2 python Which one does multi-node scenarios (cluster sessions) mean, command 1 or command 2?

And, if we implement the multi scenario in the client-side, is there no need to fix it anymore?

youngjun0627 avatar Nov 27 '21 11:11 youngjun0627