backend.ai icon indicating copy to clipboard operation
backend.ai copied to clipboard

Running multiple agents on a single node

Open achimnol opened this issue 3 years ago • 2 comments

To run two or more Backend.AI agents on a single GPU node, we need to do the followings:

  • [x] Have a config option to use separate IPC socket directories. (lablup/backend.ai-agent#347)
  • [ ] Have localized (i.e., in the agent.toml file) device mask config options for all compute plugins
  • [ ] Distinguish which agent instance owns a running kernel container when rescanning them and when processing the Docker event (lifecycle events) stream
    • We need to add a new label when creating the container to tag which agent owns that container.
  • [ ] Test and improve our compute plugins (including CUDA) to handle statistics from non-owned containers
    • We should deprecate support for CentOS 7 because NS PID mapping query is not supported and our fallback routine is not safe for having multiple agents.
  • [ ] Write a guide documentation about how to configure and deploy multi-agent-on-single-node setups.
  • [ ] Q: Should we have separate watchers for each agent instance or a unified one?

achimnol avatar Jun 27 '22 04:06 achimnol

refs #208

achimnol avatar Jul 04 '22 05:07 achimnol

  1. https://github.com/lablup/backend.ai/pull/624 Is this PR able to resolve second item of this issue - have localized device mask config options?

  2. https://github.com/lablup/backend.ai/pull/712 And this PR can resolve third item of this issue - distinguish which agent instance owns a running kernel container? I just let agents filter out the kernels which is not registered in agent's kernel registry because I think that one kernel cannot have multiple kernels which spawn by different agents.

fregataa avatar Sep 15 '22 08:09 fregataa

  1. feat: allow passing blocklist to compute plugin ctx read from local config #624 Is this PR able to resolve second item of this issue - have localized device mask config options?

I think we should have "allowlist" approach instead of "blocklist" for more human-friendly configuration for multi-agent on single-node.

  1. feat: skip containers owned by other agents in the same host #712 And this PR can resolve third item of this issue - distinguish which agent instance owns a running kernel container? I just let agents filter out the kernels which is not registered in agent's kernel registry because I think that one kernel cannot have multiple kernels which spawn by different agents.

#712 is not what I want. We should attach explicit labels to conatiners when creating them to distinguish the owner agent. Ultimately I'm going to remvoe kernel_registry at all and migrate to the reconcilation loop design.

achimnol avatar May 05 '23 06:05 achimnol