Running multiple agents on a single node
To run two or more Backend.AI agents on a single GPU node, we need to do the followings:
- [x] Have a config option to use separate IPC socket directories. (lablup/backend.ai-agent#347)
- [ ] Have localized (i.e., in the
agent.tomlfile) device mask config options for all compute plugins - [ ] Distinguish which agent instance owns a running kernel container when rescanning them and when processing the Docker event (lifecycle events) stream
- We need to add a new label when creating the container to tag which agent owns that container.
- [ ] Test and improve our compute plugins (including CUDA) to handle statistics from non-owned containers
- We should deprecate support for CentOS 7 because NS PID mapping query is not supported and our fallback routine is not safe for having multiple agents.
- [ ] Write a guide documentation about how to configure and deploy multi-agent-on-single-node setups.
- [ ] Q: Should we have separate watchers for each agent instance or a unified one?
refs #208
-
https://github.com/lablup/backend.ai/pull/624 Is this PR able to resolve second item of this issue - have localized device mask config options?
-
https://github.com/lablup/backend.ai/pull/712 And this PR can resolve third item of this issue - distinguish which agent instance owns a running kernel container? I just let agents filter out the kernels which is not registered in agent's kernel registry because I think that one kernel cannot have multiple kernels which spawn by different agents.
- feat: allow passing blocklist to compute plugin ctx read from local config #624 Is this PR able to resolve second item of this issue - have localized device mask config options?
I think we should have "allowlist" approach instead of "blocklist" for more human-friendly configuration for multi-agent on single-node.
- feat: skip containers owned by other agents in the same host #712 And this PR can resolve third item of this issue - distinguish which agent instance owns a running kernel container? I just let agents filter out the kernels which is not registered in agent's kernel registry because I think that one kernel cannot have multiple kernels which spawn by different agents.
#712 is not what I want. We should attach explicit labels to conatiners when creating them to distinguish the owner agent. Ultimately I'm going to remvoe kernel_registry at all and migrate to the reconcilation loop design.