backend.ai Running multiple agents on a single node

To run two or more Backend.AI agents on a single GPU node, we need to do the followings:

[x] Have a config option to use separate IPC socket directories. (lablup/backend.ai-agent#347)
[ ] Have localized (i.e., in the agent.toml file) device mask config options for all compute plugins
[ ] Distinguish which agent instance owns a running kernel container when rescanning them and when processing the Docker event (lifecycle events) stream
- We need to add a new label when creating the container to tag which agent owns that container.
[ ] Test and improve our compute plugins (including CUDA) to handle statistics from non-owned containers
- We should deprecate support for CentOS 7 because NS PID mapping query is not supported and our fallback routine is not safe for having multiple agents.
[ ] Write a guide documentation about how to configure and deploy multi-agent-on-single-node setups.
[ ] Q: Should we have separate watchers for each agent instance or a unified one?

Jun 27 '22 04:06 achimnol

refs #208

Jul 04 '22 05:07 achimnol

https://github.com/lablup/backend.ai/pull/624 Is this PR able to resolve second item of this issue - have localized device mask config options?
https://github.com/lablup/backend.ai/pull/712 And this PR can resolve third item of this issue - distinguish which agent instance owns a running kernel container? I just let agents filter out the kernels which is not registered in agent's kernel registry because I think that one kernel cannot have multiple kernels which spawn by different agents.

Sep 15 '22 08:09 fregataa

feat: allow passing blocklist to compute plugin ctx read from local config #624 Is this PR able to resolve second item of this issue - have localized device mask config options?

I think we should have "allowlist" approach instead of "blocklist" for more human-friendly configuration for multi-agent on single-node.

feat: skip containers owned by other agents in the same host #712 And this PR can resolve third item of this issue - distinguish which agent instance owns a running kernel container? I just let agents filter out the kernels which is not registered in agent's kernel registry because I think that one kernel cannot have multiple kernels which spawn by different agents.

#712 is not what I want. We should attach explicit labels to conatiners when creating them to distinguish the owner agent. Ultimately I'm going to remvoe kernel_registry at all and migrate to the reconcilation loop design.

May 05 '23 06:05 achimnol